Aqueous solubility (ability to dissolve in water) is an essential property of a chemical compound important in the laboratory. Can the solubility of a compound be predicted based on a chemical structure alone? John Delaney posed this predictions question in 2004 (Delaney 2004) and wrote a paper with numerous citations in the chemistry literature. This study will take a dataset similar to that study and use linear and random forest regression to predict the compounds’ solubilities.

Continue reading

When I was teaching introductory Python to scientists and engineers getting their start in data science, I would often get the question: “Why aren’t you teaching this course using R?” At the time, I didn’t have a satisfactory answer. At the time, I didn’t know any R, so I gave a hand-wavy response of “Python is more comfortable to integrate into a larger software ecosystem.” Then, I would proceed to teach in Python.

Continue reading

Aqueous solubility (ability to dissolve in water) is an important property of a chemical compound that is important in the laboratory. While it is possible to determine these solubilities through physical experiments, let’s assume for this tiny project that such experiments are prohibitively expensive. This presents an interesting predictive modeling problem: given a known chemical structure, can the aqueous solubility of a compound be predicted without physical experiments? This was the question proposed by Delaney in 2004 (Delaney, 2004) in a study that created a simple regression model that took SMILES strings (a data format to store the structure of chemical compounds), extracted features from these data and created a regression model to predict solubility.

Continue reading

I predicted the aqueous solubility of chemical compounds listed in a public dataset using a basic linear model and an untuned random forest in my previous post. The random forest showed the two models’ best performance, achieving an RMSE of 0.866 and an R^2 of 0.828. In this post, I train an XGBoost model on the same data. Load the libraries library(tidyverse) library(tidymodels) library(ranger) library(usemodels) library(gridExtra) library(vip) theme_set(theme_classic()) Load, split, and bootstrap sample the data.

Continue reading

Author's picture

Alicia Key

I am passionate about data and science.

PhD student

Aurora, Colorado, USA