R: Water Potability
Water Potability
In this post, I explore a dataset with observations of water sample properties and their corresponding drinkability. The dataset is from Aditya Kadiwal on Kaggle. In this analysis, I compare the performance of logistic, decision tree, and xgboost classification models by tracking each model’s ROC AUC metric. The xgboost model wins.
In this post, I use the R tidymodels framework. Tidymodels aims to unify models and modeling engines to streamline machine learning workflows under a consistent interface. It contains libraries for sampling, feature engineering, model creation, model fitting, model tuning, and result evaluation. Here, I use the following model and engine combinations:
Model | Engine |
---|---|
Logistic | glm |
Decision tree | rpart |
Xgboost | xgboost |
Tidymodels allows me to handle these configurations in similar ways, which makes it easier to compare the results of the models.
Required libraries and random seed
Required libraries
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(rsample)
library(parsnip)
library(yardstick)
library(ggplot2)
library(tune)
library(recipes)
library(workflows)
library(dials)
library(xgboost)
Random seed
set.seed(123)
Initial data load
The dataset contains 3276 rows of data, which I split into 2457 rows for training and 819 rows for testing. All observations have one categorical classification variable and nine numeric predictor variables.
According to the Kaggle post, the variable meanings are:
Variable | Meaning |
---|---|
pH | pH of water (0 to 14). |
Hardness | Capacity of water to precipitate soap in mg/L. |
Solids | Total dissolved solids in ppm. |
Chloramines | Amount of Chloramines in ppm. |
Sulfate | Amount of Sulfates dissolved in mg/L. |
Conductivity | Electrical conductivity of water in μS/cm. |
Organic_carbon | Amount of organic carbon in ppm. |
Trihalomethanes | Amount of Trihalomethanes in μg/L. |
Turbidity | Measure of light emitting property of water in NTU. |
Potability | Indicates if water is safe for human consumption. Potable=1 and Not potable=0 |
water <- read_csv("data/water_potability.csv")
knitr::kable(head(water))
ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability |
---|---|---|---|---|---|---|---|---|---|
NA | 204.8905 | 20791.32 | 7.300212 | 368.5164 | 564.3087 | 10.379783 | 86.99097 | 2.963135 | 0 |
3.716080 | 129.4229 | 18630.06 | 6.635246 | NA | 592.8854 | 15.180013 | 56.32908 | 4.500656 | 0 |
8.099124 | 224.2363 | 19909.54 | 9.275884 | NA | 418.6062 | 16.868637 | 66.42009 | 3.055934 | 0 |
8.316766 | 214.3734 | 22018.42 | 8.059332 | 356.8861 | 363.2665 | 18.436525 | 100.34167 | 4.628770 | 0 |
9.092223 | 181.1015 | 17978.99 | 6.546600 | 310.1357 | 398.4108 | 11.558279 | 31.99799 | 4.075075 | 0 |
5.584087 | 188.3133 | 28748.69 | 7.544869 | 326.6784 | 280.4679 | 8.399735 | 54.91786 | 2.559708 | 0 |
Feature engineering
For this analysis, I made two changes to the dataset.
Visualizing number of missing values.
First, I had to fill in many missing values. I modified the approach in the article Missing Value Visualization to create a visualization of the percent of missing values for three dataset columns.
water_total_rows = nrow(water)
water_missing <- water %>%
pivot_longer(everything(), names_to = "name", values_to = "value") %>%
mutate(is_missing = is.na(value)) %>%
group_by(name, is_missing) %>%
summarize(num_missing = n()) %>%
filter(is_missing) %>%
transmute(
percent_missing = num_missing / water_total_rows * 100
) %>%
rename(variable = name) %>%
arrange(desc(percent_missing))
knitr::kable(water_missing)
variable | percent_missing |
---|---|
Sulfate | 23.840049 |
ph | 14.987790 |
Trihalomethanes | 4.945055 |
ggplot(water_missing, aes(x = variable, y = percent_missing)) +
geom_col() +
labs(
title = "Percentage of values missing",
subtitle = "Higher is worse"
)
Transform Potability
into a factor
Second, I changed the potability
column into a factor with two levels of potable
and not_potable
. I am training the models to predict when a water sample is drinkable, so the first level is the event of interest.
water_2 <- water %>%
mutate(potable = factor(
case_when(
Potability == 0 ~ "not_potable",
Potability == 1 ~ "potable"
),
levels = c("potable", "not_potable")
)
) %>%
select(-Potability)
knitr::kable(head(water_2))
ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | potable |
---|---|---|---|---|---|---|---|---|---|
NA | 204.8905 | 20791.32 | 7.300212 | 368.5164 | 564.3087 | 10.379783 | 86.99097 | 2.963135 | not_potable |
3.716080 | 129.4229 | 18630.06 | 6.635246 | NA | 592.8854 | 15.180013 | 56.32908 | 4.500656 | not_potable |
8.099124 | 224.2363 | 19909.54 | 9.275884 | NA | 418.6062 | 16.868637 | 66.42009 | 3.055934 | not_potable |
8.316766 | 214.3734 | 22018.42 | 8.059332 | 356.8861 | 363.2665 | 18.436525 | 100.34167 | 4.628770 | not_potable |
9.092223 | 181.1015 | 17978.99 | 6.546600 | 310.1357 | 398.4108 | 11.558279 | 31.99799 | 4.075075 | not_potable |
5.584087 | 188.3133 | 28748.69 | 7.544869 | 326.6784 | 280.4679 | 8.399735 | 54.91786 | 2.559708 | not_potable |
Train/test split
I split the data into 75% for training and 25% for testing. I use the same train/test split across all models in this comparison.
water_split <- initial_split(water_2, prop = 0.75, strata = potable)
water_train <- training(water_split)
Filling missing values with k nearest neighbors (kNN)
I use kNN imputation to fill the missing values in the data.
clean_water_recipe <- recipe(potable ~ ., data = water_train) %>%
step_impute_knn(all_numeric(), neighbors = 10)
Exploratory visualization
First, get a copy of the all data with the missing values filled in to use in the visualizations.
clean_water <- clean_water_recipe %>%
prep(training = water_train) %>%
bake(new_data = water_2)
How many observations are in class?
More of the observations are of non-potable water samples. I suspect this made it harder to train the models on examples of potable water.
clean_water_class_count <- clean_water %>%
group_by(potable) %>%
summarize(class_count = n())
knitr::kable(clean_water_class_count)
potable | class_count |
---|---|
potable | 1278 |
not_potable | 1998 |
ggplot(clean_water_class_count, aes(x = potable, y = class_count)) +
geom_col() +
labs(
title = "Class count in entire dataset",
subtitle = "Equal heights are better"
)
Distributions variables depending on potability
After filling the missing values with kNN, I made violin plots lining up distributions for each variable depending on whether they were from potable samples.
ggplot(clean_water, aes(x = potable, y = ph)) +
geom_violin() +
ggtitle("pH")
ggplot(clean_water, aes(x = potable, y = Hardness)) +
geom_violin() +
ggtitle("Hardness")
ggplot(clean_water, aes(x = potable, y = Solids)) +
geom_violin() +
ggtitle("Solids")
ggplot(clean_water, aes(x = potable, y = Chloramines)) +
geom_violin() +
ggtitle("Chloramines")
ggplot(clean_water, aes(x = potable, y = Sulfate)) +
geom_violin() +
ggtitle("Sulfate")
ggplot(clean_water, aes(x = potable, y = Conductivity)) +
geom_violin() +
ggtitle("Conductivity")
ggplot(clean_water, aes(x = potable, y = Organic_carbon)) +
geom_violin() +
ggtitle("Organic Carbon")
ggplot(clean_water, aes(x = potable, y = Trihalomethanes)) +
geom_violin() +
ggtitle("Trihalomethanes")
ggplot(clean_water, aes(x = potable, y = Turbidity)) +
geom_violin() +
ggtitle("Turbidity")
Cross validation and evaluation metrics
I use the same k-fold cross validation and evaluation metrics for all models in this post.
k-fold cross validation
I use 20 folds stratified by the potable
outcome variable for cross-validation.
water_folds <- vfold_cv(water_train, v = 20, strata = potable)
Evaluation metrics
I gather ROC AUC, sensitivity, and specificity metrics for each fold and hyperparameter combination during training.
common_metric_set <- metric_set(roc_auc, sens, spec)
Logistic classifier
I fit a logistic classifier first to establish a baseline of performance that the other models should exceed. There are no hyperparameters to configure on the logistic model, so I don’t have a tuning step.
Create and fit the logistic model
logistic_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_last_fit <- logistic_model %>%
last_fit(potable ~ ., split = water_split)
Evaluate logistic performance
The ROC AUC curve lines up mainly along the diagonal for the logistic classifier, so the classification performance is about the same as a coin toss. That’s pretty bad and won’t be hard to exceed using the other models.
logistic_last_fit_metrics <- logistic_last_fit %>%
collect_metrics()
logistic_last_fit_roc_auc <- logistic_last_fit_metrics %>%
filter(.metric == "roc_auc") %>%
pull(.estimate)
knitr::kable(logistic_last_fit_metrics)
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.6177606 | Preprocessor1_Model1 |
roc_auc | binary | 0.4844861 | Preprocessor1_Model1 |
logistic_last_fit_results <- logistic_last_fit %>%
collect_predictions()
logistic_last_fit_results %>%
roc_curve(truth = potable, .pred_potable) %>%
autoplot()
Decision tree classifier
The second classifier I trained was a decision tree. Here, there were cost_complexity
, tree_depth
, and min_n
hyperparameters to tune. I use k-fold cross validation and a simple grid search to tune these hyperparameters.
Decision tree workflow and grid
decision_tree_model <- decision_tree(
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart") %>%
set_mode("classification")
decision_tree_workflow <- workflow() %>%
add_model(decision_tree_model) %>%
add_recipe(clean_water_recipe)
decision_tree_grid <- grid_random(parameters(decision_tree_model), size = 10)
Tune the decision tree
doParallel::registerDoParallel()
decision_tree_tuning <- decision_tree_workflow %>%
tune_grid(
resamples = water_folds,
grid = decision_tree_grid,
metrics = common_metric_set
)
Fit the best decision tree
I select the top-performing decision tree based on the ROC AUC metric.
best_decision_tree_model <- decision_tree_tuning %>%
select_best(metric = "roc_auc")
final_decision_tree_workflow <- decision_tree_workflow %>%
finalize_workflow(best_decision_tree_model)
decision_tree_final_fit <- final_decision_tree_workflow %>%
last_fit(split = water_split)
Evaluate decision tree performance
The decision tree’s ROC AUC is 0.62, which, while better than the logistic classifier, is still not very good.
decision_tree_final_fit_metrics <- decision_tree_final_fit %>%
collect_metrics()
decision_tree_final_fit_roc_auc <- decision_tree_final_fit_metrics %>%
filter(.metric == "roc_auc") %>%
pull(.estimate)
knitr::kable(decision_tree_final_fit_metrics)
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.6231707 | Preprocessor1_Model1 |
roc_auc | binary | 0.6220531 | Preprocessor1_Model1 |
decision_tree_final_fit %>%
collect_predictions() %>%
roc_curve(truth = potable, .pred_potable) %>%
autoplot()
xgboost
The third classifier I trained was an xgboost model. Here, there were the trees
, min_n
, tree_depth
, learn_rate
, loss_reduction
, and sample_size
hyperparameters to tune. I tune these hyperparameters with a random grid search.
Define the model and the workflow to go with it.
xgboost_model <-
boost_tree(
trees = tune(),
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune(),
sample_size = tune()
) %>%
set_mode("classification") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(clean_water_recipe) %>%
add_model(xgboost_model)
xgboost_grid <- grid_random(parameters(xgboost_model), size = 10)
Tune the xgboost
doParallel::registerDoParallel()
xgboost_tuning <- xgboost_workflow %>%
tune_grid(
resamples = water_folds,
grid = xgboost_grid,
metrics = common_metric_set
)
Finalize the best xgboost
best_xgboost_model <- xgboost_tuning %>%
select_best(metric = "roc_auc")
final_xgboost_workflow <- xgboost_workflow %>%
finalize_workflow(best_xgboost_model)
xgboost_final_fit <- final_xgboost_workflow %>%
last_fit(split = water_split)
Evaluate xgboost performance
With a ROC AUC of 0.65, the xgboost outperforms the decision tree, but not by much.
xgboost_final_fit_metrics <- xgboost_final_fit %>%
collect_metrics()
xgboost_final_fit_roc_auc <- xgboost_final_fit_metrics %>%
filter(.metric == "roc_auc") %>%
pull(.estimate)
knitr::kable(xgboost_final_fit_metrics)
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.6500000 | Preprocessor1_Model1 |
roc_auc | binary | 0.6547125 | Preprocessor1_Model1 |
xgboost_final_fit %>%
collect_predictions() %>%
roc_curve(truth = potable, .pred_potable) %>%
autoplot()
Final comparison
I plotted the ROC AUC metrics for the final comparison for all three models. The xgboost model is the best, followed by the decision tree and logistic models.
comparison <- tibble(
model = factor(
c("xgboost", "decision tree", "logistic"),
levels = c("xgboost", "decision tree", "logistic")
),
roc_auc = c(
xgboost_final_fit_roc_auc,
decision_tree_final_fit_roc_auc,
logistic_last_fit_roc_auc
)
) %>%
arrange(desc(roc_auc))
knitr::kable(comparison)
model | roc_auc |
---|---|
xgboost | 0.6547125 |
decision tree | 0.6220531 |
logistic | 0.4844861 |
ggplot(comparison, aes(x = model, y = roc_auc)) +
geom_col() +
labs(
y = "ROC AUC",
title = "Comparison of xgboost and decision tree ROC AUC metrics",
subtitle = "Higher is better"
)
Conclusion and next steps
My feature engineering only included kNN missing value imputation. If I were to revisit this analysis, I would explore other feature engineering options to improve the quality of the input data.