R: Deep Learning Organic Chemistry Part 3

Introduction

In my post “R: Deep Learning Organic Chemistry Again,” I trained a convolutional neural network based on VGG16 to recognize a benzene ring diagram, a crucial structure in many organic chemistry molecules. The classification problem I posed to the convnet was a binary classification to separate diagrams of molecules that contain a benzene ring from those that do not. However, near the end of that post, I found images I had mistakenly put in the wrong training and validation folders. My mistake meant I attempted to train and evaluate the convnet on corrupt training data. Oops!

I fixed the training and validation data classifications by moving two images from the non-benzene ring class to the benzene ring class. I then reran the same R as before. With good training and validation data, my model’s performance improved the accuracy, true positive rate (TPR), and false negative rate (FNR) metrics. Interstingly, the true negative rate (TNR) and false positive rate (FPR) became worse.

With corrupt training and validation data:

accuracy	tpr	tnr	fpr	fnr
0.8181818	0.7636364	0.8727273	0.1272727	0.2363636

With fixed training and validation data:

accuracy	tpr	tnr	fpr	fnr
0.8454545	0.877193	0.8113208	0.1886792	0.122807

While I still had a suboptimal number of images, I am happy that I improved the model’s performance.

Below is the R code that trained and evaluated the model. Following the R code, I manually examine misclassifications in the validation set to see what the model got wrong.

Define where the image datasets are

There is a small dataset of 410 images. Because there are so few images, I will just work with a training and validation set, and I will not use a separate test dataset.

dataset	benzene ring images	non benzene ring images
train	152	148
validation	57	53

train_dir <- "data/train"
validation_dir <- "data/validation"

Create the image generators

Use data augmentation on the training set, but not the validation set.

Create a function to create the generators. This function will be called twice. The first time will be during training of the first network. The second time will be training the second network.

create_generators <- function(test_validation_batch_size) {
  validation_datagen <- image_data_generator(rescale = 1/255)
  train_datagen <- image_data_generator(rescale = 1/255)
  
  train_generator <- flow_images_from_directory(
    train_dir,
    train_datagen,
    target_size = c(256, 256),
    batch_size = test_validation_batch_size,
    shuffle = FALSE,
    class_mode = "binary"
  )
  
  validation_generator <- flow_images_from_directory(
    validation_dir,
    validation_datagen,
    target_size = c(256, 256),
    batch_size = test_validation_batch_size,
    shuffle = FALSE,
    class_mode = "binary"
  )
  
  list(train_generator = train_generator, validation_generator = validation_generator)
}

Define the network

Use VGG16 trained on imagenet for the convolutional layers. Freeze the convolutional layers so that their weights are not adjusted during training. The final layers will be a simple binary classifier made with dense layers.

First, make a function that can return an initalized network.

create_network <- function() {
  conv_base <- application_vgg16(
    weights = "imagenet",
    include_top = FALSE,
    input_shape = c(256, 256, 3)
  )
  
  conv_base %>% freeze_weights()
  
  network <- keras_model_sequential() %>%
    conv_base %>%
    layer_flatten() %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 128, activation = "relu") %>%
    layer_dense(units = 64, activation = "relu") %>%
    layer_dense(units = 1, activation = "sigmoid")
  
  network %>% compile(
    loss = "binary_crossentropy",
    optimizer = optimizer_rmsprop(learning_rate = 0.5e-5),
    metrics = c("accuracy")
  )
  
  network
}

Now use the function to create the network

convnet_1 <- create_network()

## Loaded Tensorflow version 2.7.0

summary(convnet_1)

## Model: "sequential"
## _____________________________________________________________________
##  Layer (type)                  Output Shape               Param #    
## =====================================================================
##  vgg16 (Functional)            (None, 8, 8, 512)          14714688   
##  flatten (Flatten)             (None, 32768)              0          
##  dense_3 (Dense)               (None, 256)                8388864    
##  dense_2 (Dense)               (None, 128)                32896      
##  dense_1 (Dense)               (None, 64)                 8256       
##  dense (Dense)                 (None, 1)                  65         
## =====================================================================
## Total params: 23,144,769
## Trainable params: 8,430,081
## Non-trainable params: 14,714,688
## _____________________________________________________________________

Fit the network using the training and validation data

Checkpoint every model that shows a better val_accuracy than the previous. Stop training when there have been 5 epochs without improvement to val_accuracy.

tensorflow::set_random_seed(0)

callbacks_list <- list(
  callback_model_checkpoint(
    filepath = "organicml_checkpoint.h5",
    monitor = "val_accuracy",
    mode = "max",
    save_best_only = TRUE
  )
)

test_validation_batch_size <- 25
generators_1 <- create_generators(test_validation_batch_size = test_validation_batch_size)

fit_history_1 <- convnet_1 %>%
  fit(
    generators_1$train_generator,
    steps_per_epoch = 300/test_validation_batch_size,
    epochs = 50,
    validation_data = generators_1$validation_generator,
    validation_steps = 110/test_validation_batch_size,
    callbacks = callbacks_list
  )

Analyze training history

plot(fit_history_1)

What epoch had the maximum validation accuracy?

max_val_accuracy_1 <- max(fit_history_1$metrics$val_accuracy)
argmax_val_accuracy_1 <- which.max(fit_history_1$metrics$val_accuracy)
cat("Maximum val_accuracy: ", max_val_accuracy_1, "\n")

## Maximum val_accuracy:  0.86

cat("Epoch of maximum validation accuracy: ", argmax_val_accuracy_1, "\n")

## Epoch of maximum validation accuracy:  49

Recreate generator for evaluation images

batch_size <- 10
validation_dir <- "data/validation"

validation_datagen <- image_data_generator(rescale = 1/255)

## Loaded Tensorflow version 2.7.0

validation_generator <- flow_images_from_directory(
    validation_dir,
    validation_datagen,
    target_size = c(256, 256),
    batch_size = batch_size,
    shuffle = FALSE,
    class_mode = "binary"
  )

Reload and Evaluate

Reload and evaluate the best model that was saved.

network <- load_model_hdf5("organicml_checkpoint.h5")
network %>% evaluate(validation_generator)

##      loss  accuracy 
## 0.3878245 0.8454546

Make a vector of all output from the model

Present each batch of images from the generator (it has already been iterated over once) and obtain the output of the sigmoid activation function that is the last layer of the network.

predicted <- c()
actual <- c()
network_input <- c()
i <- 0
while(TRUE) {
  batch <- generator_next(validation_generator)
  inputs_batch <- batch[[1]]
  labels_batch <- batch[[2]]
  predicted <- c(predicted, round(network %>% predict(inputs_batch)))
  actual <- c(actual, labels_batch)
  network_input <- c(network_input, inputs_batch)
  i <- i + 1
  if (i * batch_size >= validation_generator$samples)
    break
}

Compute the error rate

Assemble a dataframe that lines up predictions alongside the actual labels and the input image filename.

evaluation_df <- tibble(
    image_filename = validation_generator$filepaths,
    actual = actual,
    predicted = predicted
  )

Compute accuracy, true positive rate (tpr), true negative rate (tnr), false negative rate (fnr), and false positive rate (fpr). Using formulas from https://en.wikipedia.org/wiki/Sensitivity_and_specificity.

network_performance <- evaluation_df %>%
  summarize(
    accuracy = sum(actual == predicted) / n(),
    tpr = sum(actual == 1 & predicted == 1) / sum(actual == 1),
    tnr = sum(actual == 0 & predicted == 0) / sum(actual == 0),
    fpr = sum(actual == 0 & predicted == 1) / sum(actual == 0),
    fnr = sum(actual == 1 & predicted == 0) / sum(actual == 1)
  )
knitr::kable(network_performance)

accuracy	tpr	tnr	fpr	fnr
0.8454545	0.877193	0.8113208	0.1886792	0.122807

Create a csv of failed classification attempts

evaluation_df %>%
  filter(actual != predicted) %>%
  write_csv("organicml-evaluation-failures.csv")

Observe the misclassifications

A positive class of 1 means that the compound diagram contains a benzene ring and a negative class of 0 means no benzene ring.

Actual class	Predicted class	What failed
1	0	Substituent groups attached confused the convnet.
1	0	Substituent groups attached confused the convnet.
1	0	Substituent groups attached confused the convnet.
1	0	Substituent groups attached confused the convnet.
1	0	Substituent groups attached confused the convnet.
1	0	Substituent groups attached confused the convnet.
0	1	A prominent and well-positioned hexagon without double bonds confused the convnet.
0	1	Three prominent and well-positioned hexagons without double bonds confused the convnet.
0	1	A prominent and well-positioned hexagon without double bonds confused the convnet. Could the “3” have been misinterpreted as a double bond?
0	1	A prominent and well-positioned hexagon without double bonds confused the convnet. Could the “3” have been misinterpreted as a double bond?
0	1	A prominent and well-positioned hexagon without double bonds confused the convnet…but, there is one double bond in this image, just not in the hexagon.
0	1	The hexagon should have had 3 double bonds to be in the positive class.
0	1	The hexagon should have had 3 double bonds to be in the positive class.
0	1	The convnet probably mistook the triple-bonded C N atoms for a double bond.
0	1	Again, there need to be three double bonds in the hexagon for the postive class.
0	1	A prominent and well-positioned hexagon without double bonds confused the convnet.

Conclusion

To improve the network (yet again) I would provide more training images and, in particular, benzene rings with substitutent groups that are labeled in the positive class.