In my prior aqueous solubility regression study, I did an exploratory data visualization and found intriguing plots of solubility versus other variables in the study. I didn’t perform any experimental modeling of those relationships in that study. Here, I followup by performing a cluster analysis of solubility relationships to help future regression modeling efforts. My question is: do clusters within each of these relationships explain each feature’s effect on solubility?

Table 1 is a sample of some of the compounds in the dataset:

df <- as_tibble(read.csv("data/delaney-processed.csv")) %>%
  select(
    compound = Compound.ID, 
    mw = Molecular.Weight, 
    h_bond_donors = Number.of.H.Bond.Donors, 
    rings = Number.of.Rings, 
    rotatable_bonds = Number.of.Rotatable.Bonds, 
    psa = Polar.Surface.Area, 
    solubility = measured.log.solubility.in.mols.per.litre
)
knitr::kable(head(df))
compound mw h_bond_donors rings rotatable_bonds psa solubility
Amigdalin 457.432 7 3 7 202.32 -0.77
Fenfuram 201.225 1 2 2 42.24 -3.30
citral 152.237 0 0 4 17.07 -2.06
Picene 278.354 0 5 0 0.00 -7.87
Thiophene 84.143 0 1 0 0.00 -1.33
benzothiazole 135.191 0 2 0 12.89 -1.50

Dataset description

See the previous post for a description of the dataset and variables I use in this study.

Review of prior figures

Figure 1 contains histograms and bar plots showing the distributions of variables in the datasets.

p1 <- ggplot(df, aes(x = mw)) +
  geom_histogram(bins = 10) +
  labs(title = "(a)")

p2 <- ggplot(df, aes(x = psa)) +
  geom_histogram(bins = 10) +
  labs(title = "(b)")

p3 <- ggplot(df, aes(x = solubility)) +
  geom_histogram(bins = 10) +
  labs(title = "(c)")

p4 <- ggplot(df, aes(x = h_bond_donors)) +
  geom_bar() +
  labs(title = "(d)")

p5 <- ggplot(df, aes(x = rings)) +
  geom_bar() +
  labs(title = "(e)")

p6 <- ggplot(df, aes(x = rotatable_bonds)) +
  geom_bar() +
  labs(title = "(f)")

grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3)
Histograms and bar plots of variables

Figure 1: Histograms and bar plots of variables

As shown above, the subplots 1a, 1b, 1d, 1e, and 1f show distributions that favor the low end of the distribution. This low-end favorability is essential when extracting relationships for values greater in these distributions. For example, data about what happens to solubility with large ring counts are relatively sparse.

Figure 2 contains the plots of solubility versus other variables I want to explore with clustering. Note that the plots have jittered points to prevent overplotting. Also, note that the solubility is in log(mol/L) because solubility in this dataset spans many orders of magnitude.

alpha <- 0.1

p1 <- ggplot(df, aes(x = mw, y = solubility)) +
  geom_jitter(alpha = alpha, width = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "(a)")

p2 <- ggplot(df, aes(x = psa, y = solubility)) +
  geom_jitter(alpha = alpha) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "(b)")

p3 <- ggplot(df, aes(x = h_bond_donors, y = solubility)) +
  geom_jitter(alpha = alpha, width = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "(c)")

p4 <- ggplot(df, aes(x = rings, y = solubility)) +
  geom_jitter(alpha = alpha, width = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "(d)")

p5 <- ggplot(df, aes(x = rotatable_bonds, y = solubility)) +
  geom_jitter(alpha = alpha, width = 0.1) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "(e)")

grid.arrange(p1, p2, p3, p4, p5, nrow = 3)
Relationship of molecular weight to other variables

Figure 2: Relationship of molecular weight to other variables

Each variable has a different trend of its effect on solubility, as shown above. Figures 2b (of polar surface area) and 2c (of h-bond donors) show increasing trends of solubility. Figures 2a (molecular weight), 2d (rings), and 2e (rotatable bonds) show decreasing trends in solubility.

Solubility versus h-bond donors by cluster

h_bond_donor_cluster_df <- hierarchical_clusters(select(df, solubility, h_bond_donors))

cluster_2 <- ggplot(h_bond_donor_cluster_df, aes(x = h_bond_donors, y = solubility, col = cluster_2)) +
  geom_point(alpha = alpha) +
  labs(title = "(a)") +
  theme_minimal()

cluster_3 <- ggplot(h_bond_donor_cluster_df, aes(x = h_bond_donors, y = solubility, col = cluster_3)) +
  geom_point(alpha = alpha) +
  labs(title = "(b)") +
  theme_minimal()

cluster_4 <- ggplot(h_bond_donor_cluster_df, aes(x = h_bond_donors, y = solubility, col = cluster_4)) +
  geom_point(alpha = alpha) +
  labs(title = "(c)") +
  theme_minimal()

grid.arrange(cluster_2, cluster_3, cluster_4, nrow = 2)
Solubility versus h-bond donors by cluster

Figure 4: Solubility versus h-bond donors by cluster

Similar to Figure 3, Figure 4a with two clusters provides no clear break between high and low solubility. Figure 4c shows that the fourth cluster covers a few points a corner of the plot. Once again, Figure 4b shows that three clusters provide a clear break between high and low solubility, with a few points covered by a third cluster. I’ll choose 3 clusters as optimal for h-bond donors.

Solubility versus rings by cluster

rings_cluster_df <- hierarchical_clusters(select(df, solubility, rings))

cluster_2 <- ggplot(rings_cluster_df, aes(x = rings, y = solubility, col = cluster_2)) +
  geom_point(alpha = alpha, size = size) +
  labs(title = "(a)")

cluster_3 <- ggplot(rings_cluster_df, aes(x = rings, y = solubility, col = cluster_3)) +
  geom_point(alpha = alpha, size = size) +
  labs(title = "(b)")

cluster_4 <- ggplot(rings_cluster_df, aes(x = rings, y = solubility, col = cluster_4)) +
  geom_point(alpha = alpha, size = size) +
  labs(title = "(c)")

grid.arrange(cluster_2, cluster_3, cluster_4, nrow = 2)
Solubility versus rings

Figure 5: Solubility versus rings

Figure 5a shows two clusters, each of which straddles high and low solubility. Figure 5b, with three clusters, exhibits the same problem, though the third cluster is generally of low solubility. Figure 5c represents the separation best: cluster 1 is mostly high solubility, cluster 4 is primarily low solubility, and cluster 3 is low solubility. Cluster 2 aggregates between -3 and -5, so it to provides a useful solubility range. I’ll choose four as the optimum number of clusters.

Solubility versus rotatable_bonds by cluster

rotatable_bonds_cluster_df <- hierarchical_clusters(select(df, solubility, rotatable_bonds))

cluster_2 <- ggplot(rotatable_bonds_cluster_df, aes(x = rotatable_bonds, y = solubility, col = cluster_2)) +
  geom_point(alpha = alpha, size = size) +
  labs(title = "(a)")

cluster_3 <- ggplot(rotatable_bonds_cluster_df, aes(x = rotatable_bonds, y = solubility, col = cluster_3)) +
  geom_point(alpha = alpha, size = size) +
  labs(title = "(b)")

cluster_4 <- ggplot(rotatable_bonds_cluster_df, aes(x = rotatable_bonds, y = solubility, col = cluster_4)) +
  geom_point(alpha = alpha, size = size) +
  labs(title = "(c)")

grid.arrange(cluster_2, cluster_3, cluster_4, nrow = 2)
Solubility versus rotatable_bonds by cluster

Figure 6: Solubility versus rotatable_bonds by cluster

Figure 6a shows two clusters of rotatable bond counts that don’t provide a break between high and low solubility. Figure 6c adds the fourth cluster with a wide span of solubilities and, therefore, not much value. Once again, Figure 6b shows three clusters, with reasonably well-defined breaks around solubilities of -3 and -4. I’ll choose three as the optimum number of clusters for rotatable bonds.

Solubility versus molecular weight by cluster

mw_cluster_df <- hierarchical_clusters(select(df, solubility, mw))

cluster_2 <- ggplot(mw_cluster_df, aes(x = mw, y = solubility, col = cluster_2)) +
  geom_point(alpha = alpha, size = size) +
  labs(title = "(a)") +
  theme_minimal()

cluster_3 <- ggplot(mw_cluster_df, aes(x = mw, y = solubility, col = cluster_3)) +
  geom_point(alpha = alpha, size = size) +
  labs(title = "(b)") +
  theme_minimal()

cluster_4 <- ggplot(mw_cluster_df, aes(x = mw, y = solubility, col = cluster_4)) +
  geom_point(alpha = alpha, size = size) +
  labs(title = "(c)") +
  theme_minimal()

grid.arrange(cluster_2, cluster_3, cluster_4, nrow = 2)
Solubility versus molecular weight by cluster

Figure 7: Solubility versus molecular weight by cluster

Summary of optimum number of clusters

Variable name Optimal cluster count
polar surface area (PSA) 3
h-bond donors 3
rings 4
rotatable bonds 3
molecular weight 4

Conclusion

When I use hierarchiical clustering to group solubilities, each variable in the dataset needs a different number of clusters to adequately specify its relastionship to solubility.