R: Milk quality binary classification

Overview

I was looking for a dataset on which to train a binary classifier, and I found data for dairy milk quality prediction on Kaggle. Posted by Shrijayan Rajendran and located at https://www.kaggle.com/datasets/cpluzshrijayan/milkquality, it has a single outcome variable that classifies milk into three qualities: “low,” “medium,” and “high.” Seven predictor variables accompany these quality ratings. Conveniently, there are no missing values in the dataset.

Ultimately, I converted the three qualities into new categories with just two types. I then trained a logistic regression which achieved the following metrics:

Metric	Value
accuracy	85%
sensitivity	90%
specificty	77%
ROC AUC	0.92

The rest of this post describes my methods and next steps that might improve the model.

Methods

R libraries I used

I used R with the tidyverse tools for data import and preprocessing, along with tidymodels for creating a logistic regression model and assessing its performance.

Initial data load

In the dataset, there are 1,059 observations with eight variables each. There is no missing data.

milk_raw <- read_csv("data/milknew.csv")

## Rows: 1059 Columns: 8
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Grade
## dbl (7): pH, Temprature, Taste, Odor, Fat, Turbidity, Colour
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

knitr::kable(head(milk_raw))

pH	Temprature	Taste	Odor	Fat	Turbidity	Colour	Grade
6.6	35	1	0	1	0	254	high
6.6	36	0	1	0	1	253	high
8.5	70	1	1	1	1	246	low
9.5	34	1	1	0	1	255	low
6.6	37	0	0	0	0	255	medium
6.6	37	1	1	1	1	255	high

Preprocessing

I only performed one preprocessing step to create a binary classification from three original classes. Because there are three classes in the data, and I could only use two for a binary classifier, I combined the initial “high” and “medium” qualities into a new category I called “good” and the original “low” quality into a new category called “bad.” I then used these new categories to train the binary classifier. My positive event is finding milk of “good” quality, so I make “good” the first factor level for the outcome variable.

milk_clean <- milk_raw %>%
  transmute(
    pH,
    Temperature = Temprature,
    Odor,
    Fat,
    Turbidity,
    Taste,
    Color = Colour,
    Quality = factor(
      case_when(
        Grade == "high" ~ "good",
        Grade == "medium" ~ "good",
        Grade == "low" ~ "bad"
      ),
      levels = c("good", "bad")
    )
  )

knitr::kable(head(milk_clean))

pH	Temperature	Odor	Fat	Turbidity	Taste	Color	Quality
6.6	35	0	1	0	1	254	good
6.6	36	1	0	1	0	253	good
8.5	70	1	1	1	1	246	bad
9.5	34	1	0	1	1	255	bad
6.6	37	0	0	0	0	255	good
6.6	37	1	1	1	1	255	good

Model setup and training

I used 793 rows for training and 266 rows for testing. First, I set up a logistic regression binary classifier using R’s glm function. Then, I predicted the class (stored as the variable Quality) using every other variable.

milk_split <- initial_split(milk_clean, prop = 0.75, strata = Quality)

logistic_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

logistic_last_fit <- logistic_model %>%
  last_fit(Quality ~ ., split = milk_split)

Results

ROC AUC and accuracy metrics

logistic_last_fit %>%
  collect_metrics() %>%
  knitr::kable()

.metric	.estimator	.estimate	.config
accuracy	binary	0.8308271	Preprocessor1_Model1
roc_auc	binary	0.9195382	Preprocessor1_Model1

ROC curve and mosaic plot

logistic_last_fit_results <- logistic_last_fit %>%
  collect_predictions()

logistic_last_fit_results %>%
  roc_curve(truth = Quality, .pred_good) %>%
  autoplot()

logistic_last_fit_results %>%
  conf_mat(truth = Quality, estimate = .pred_class) %>%
  autoplot(type = "mosaic")

The ROC plot shows the ROC AUC of 0.92 visually. In the mosaic plot, we see the sensitivity of 90% is better than the specificity of 77%.

All metrics

logistic_last_fit_results %>%
  conf_mat(truth = Quality, estimate = .pred_class) %>%
  summary() %>%
  knitr::kable()

.metric	.estimator	.estimate
accuracy	binary	0.8308271
kap	binary	0.6528221
sens	binary	0.8354430
spec	binary	0.8240741
ppv	binary	0.8741722
npv	binary	0.7739130
mcc	binary	0.6537762
j_index	binary	0.6595171
bal_accuracy	binary	0.8297586
detection_prevalence	binary	0.5676692
precision	binary	0.8741722
recall	binary	0.8354430
f_meas	binary	0.8543689

Conclusion and next steps

I was happy with how simple it was to set up the logistic model using tidymodels and achieve decent results with such a basic model. However, there is room for improvement in terms of specificity. The next step would be to try more complex binary classification models to determine if they can achieve better metrics.