PDB Data Exploration

The Protein DataBank (PDB) stores files that contain the structure of “proteins, nucleic acids, and complex assemblies.” These structures are essential tools for research in structural biology, biochemistry, and related fields. I was recently browsing Kaggle for datasets and found a scrape of PDB data up to the year 2018 by Shahir. The data included sequence information as well as metadata about those sequences. Kaggle suggests these data as a multi-class classification exercise. For this post, I am interested in exploring the metadata to find the data stored in the database and trends in how researchers collected those data.

Include libraries and set plot theme.

Libraries

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(ggplot2))

Set theme and `dplyr` options

theme_set(theme_minimal())
options(dplyr.summarise.inform = FALSE)

Load the dataset

pdb <- read_csv("data/pdb_data_no_dups.csv.gz", show_col_types = FALSE)

Look at the head of the dataset

knitr::kable(head(pdb))

structureId	classification	experimentalTechnique	macromoleculeType	residueCount	resolution	structureMolecularWeight	crystallizationMethod	crystallizationTempK	densityMatthews	densityPercentSol	pdbxDetails	phValue	publicationYear
100D	DNA-RNA HYBRID	X-RAY DIFFRACTION	DNA/RNA Hybrid	20	1.90	6360.30	VAPOR DIFFUSION, HANGING DROP	NA	1.78	30.89	pH 7.00, VAPOR DIFFUSION, HANGING DROP	7	1994
101D	DNA	X-RAY DIFFRACTION	DNA	24	2.25	7939.35	NA	NA	2.00	38.45	NA	NA	1995
101M	OXYGEN TRANSPORT	X-RAY DIFFRACTION	Protein	154	2.07	18112.80	NA	NA	3.09	60.20	3.0 M AMMONIUM SULFATE, 20 MM TRIS, 1MM EDTA, PH 9.0	9	1999
102D	DNA	X-RAY DIFFRACTION	DNA	24	2.20	7637.17	VAPOR DIFFUSION, SITTING DROP	277	2.28	46.06	pH 7.00, VAPOR DIFFUSION, SITTING DROP, temperature 277.00K	7	1995
102L	HYDROLASE(O-GLYCOSYL)	X-RAY DIFFRACTION	Protein	165	1.74	18926.61	NA	NA	2.75	55.28	NA	NA	1993
102M	OXYGEN TRANSPORT	X-RAY DIFFRACTION	Protein	154	1.84	18010.64	NA	NA	3.09	60.20	3.0 M AMMONIUM SULFATE, 20 MM TRIS, 1MM EDTA, PH 9.0	9	1999

Missing data

When I looked at the .csv in Excel, I found lots of missing data in the dataset. In this section, I quantify how much missing data there is.

Bar graph of missing value counts

pdb_total_rows = nrow(pdb)

pdb_missing <- pdb %>%
  summarize(
    structureIdNa = sum(is.na(structureId)) / pdb_total_rows * 100,
    classificationNa = sum(is.na(classification)) / pdb_total_rows * 100,
    experimentalTechniqueNa = sum(is.na(experimentalTechnique)) / pdb_total_rows * 100,
    macromoleculeTypeNa = sum(is.na(macromoleculeType)) / pdb_total_rows * 100,
    residueCountNa = sum(is.na(residueCount)) / pdb_total_rows * 100,
    resolutionNa = sum(is.na(resolution)) / pdb_total_rows * 100,
    structureMolecularWeightNa = sum(is.na(structureMolecularWeight)) / pdb_total_rows * 100,
    crystallizationMethodNa = sum(is.na(crystallizationMethod)) / pdb_total_rows * 100,
    crystallizationTempK = sum(is.na(crystallizationTempK)) / pdb_total_rows * 100,
    densityMatthewsNa = sum(is.na(densityMatthews)) / pdb_total_rows * 100,
    densityPercentSolNa = sum(is.na(densityPercentSol)) / pdb_total_rows * 100,
    pdbxDetailsNa = sum(is.na(pdbxDetails)) / pdb_total_rows * 100,
    phValueNa = sum(is.na(phValue)) / pdb_total_rows * 100,
    publicationYearNa = sum(is.na(publicationYear)) / pdb_total_rows * 100
  ) %>%
  pivot_longer(everything(), names_to = "column", values_to = "percent_missing") %>%
  arrange(percent_missing)

pdb_missing %>%
  mutate(
    column_sort = factor(
      column,
      levels = pdb_missing$column
    )
  ) %>%
  ggplot(aes(x = column_sort, y = percent_missing)) +
    geom_col() +
    coord_flip() +
    labs(
      title = "Percent missing per column",
      subtitle = "Higher is worse"
    )

What if I dropped all `NA`?

Percentage observations remaining if all rows that have NA were dropped:

pdb_no_na_count <- pdb %>%
  drop_na() %>%
  nrow()

pdb_no_na_count / pdb_total_rows * 100

## [1] 46.60575

Number of observations remaining if all the NAs were dropped.

pdb_total_rows - pdb_no_na_count

## [1] 75500

Plan of action: drop rows with `NA` and filter year range

For this exploration, I will simply drop all rows that have any NA variables. I will also restrict the year range to 2000 to 2018.

pdb_clean <- pdb %>%
  drop_na() %>%
  filter(publicationYear >= 2000, publicationYear < 2018)

Exploratory Visualization and Analysis

Publications per macromolecule type, total up to 2018

Only drop the rows that have no macromolecule or publicationYear type.

Overall, it looks like proteins are by far the most dominant molecules being analyzed.

pdb_publication_count_all_time <- pdb_clean %>%
  group_by(macromoleculeType) %>%
  summarize(publicationCount = n()) %>%
  arrange(desc(publicationCount))

knitr::kable(pdb_publication_count_all_time)

macromoleculeType	publicationCount
Protein	60011
Protein#DNA	2478
Protein#RNA	1041
RNA	518
DNA	430
Protein#DNA#RNA	137
Protein#DNA#DNA/RNA Hybrid	31
DNA/RNA Hybrid	23
RNA#DNA/RNA Hybrid	17
DNA#RNA	12
Protein#DNA/RNA Hybrid	9
Protein#RNA#DNA/RNA Hybrid	4

ggplot(pdb_publication_count_all_time, aes(x = 1, y = publicationCount, fill = macromoleculeType)) +
  geom_col(position = "stack") +
  scale_fill_viridis_d() +
  labs(title = "Number of publications by macromolecule type") +
  ylab("publication count")

Publications per year, 2000 to 2017

pdb_clean %>%
  group_by(publicationYear) %>%
  summarize(publicationCount = n()) %>%
  ggplot(aes(x = publicationYear, y = publicationCount)) +
    geom_line() +
    labs(title = "Publications per year") +
    ylab("publication count")

Experimental techniques

Distribution of experimental techniques

Virtually all the publications in the cleaned dataset were from x-ray diffraction.

pdb_techique_counts <- pdb_clean %>%
  group_by(experimentalTechnique) %>%
  summarize(publicationCount = n()) %>%
  arrange(desc(publicationCount))

knitr::kable(pdb_techique_counts)

experimentalTechnique	publicationCount
X-RAY DIFFRACTION	64678
NEUTRON DIFFRACTION	11
ELECTRON CRYSTALLOGRAPHY	5
NEUTRON DIFFRACTION, X-RAY DIFFRACTION	5
POWDER DIFFRACTION	5
X-RAY DIFFRACTION, EPR	5
EPR, X-RAY DIFFRACTION	1
X-RAY DIFFRACTION, NEUTRON DIFFRACTION	1

Molecular weight by experimental technique

It looks like x-ray diffraction has the widest range of molecular weights…but there are far more x-ray diffraction observations in the cleaned dataset.

ggplot(pdb_clean, aes(x = experimentalTechnique, y = structureMolecularWeight)) +
  geom_boxplot() +
  scale_y_log10() +
  coord_flip() +
  labs(
    title = "Molecular weight by experimental technique"
  ) +
  ylab("log(molecular weight)") +
  xlab("experimental technique")

Other variables

Resolution by experimental technique

ggplot(pdb_clean, aes(x = experimentalTechnique, y = resolution)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    title = "Resolution by experimental technique"
  ) +
  ylab("resolution (angstroms)") +
  xlab("experimental technique")

densityMatthews

Mathews density as described in Matthews coefficient probabilities: Improved estimates for unit cell contents of proteins, DNA, and protein–nucleic acid complex crystals.

ggplot(pdb_clean, aes(x = densityMatthews)) +
  geom_histogram(bins = 100)

densityPercentSol

I wonder if these could be tied back to crystallization methods or crystallization temperature.

ggplot(pdb_clean, aes(x = densityPercentSol)) +
  geom_histogram(bins = 100)

Classification column

For this section, only apply cleaning to the classification column.

Top classifications

What are the top 10 classifications?

top_classifications <- pdb_clean %>%
  group_by(classification) %>%
  summarize(classification_count = n()) %>%
  arrange(desc(classification_count))

knitr::kable(head(top_classifications, 5))

classification	classification_count
HYDROLASE	10090
TRANSFERASE	7583
OXIDOREDUCTASE	6109
IMMUNE SYSTEM	2397
LYASE	2290

Conclusion

In the future, it would be interesting to:

Train a model on the sequence data that accompanies this dataset, to see if I could fill in missing values based on sequence data.
Find if there were any systematic features missing for each macromoleculeType. Maybe each macro molecule should get its own schema of attributes that are likely to be filled in.
Mine the text of pdbxDetails. Perhaps these could correlated with things like ph, resolution, experimental method, etc.
Mine the text of crystallizationMethod. Perhaps these could correlated with things like ph, resolution, experimental method, etc.
Mine the text of classification since there are so many unique values in that column.

R: PDB Data Exploration

PDB Data Exploration

Include libraries and set plot theme.

Libraries

Set theme and `dplyr` options

Load the dataset

Look at the head of the dataset

Missing data

Bar graph of missing value counts

What if I dropped all `NA`?

Plan of action: drop rows with `NA` and filter year range

Exploratory Visualization and Analysis

Publications per macromolecule type, total up to 2018

Publications per year, 2000 to 2017

Experimental techniques

Distribution of experimental techniques

Molecular weight by experimental technique

Other variables

Resolution by experimental technique

densityMatthews

densityPercentSol

Classification column

Top classifications

Conclusion

Alicia Key

R: PDB Data Exploration

PDB Data Exploration

Include libraries and set plot theme.

Libraries

Set theme and dplyr options

Load the dataset

Look at the head of the dataset

Missing data

Bar graph of missing value counts

What if I dropped all NA?

Plan of action: drop rows with NA and filter year range

Exploratory Visualization and Analysis

Publications per macromolecule type, total up to 2018

Publications per year, 2000 to 2017

Experimental techniques

Distribution of experimental techniques

Molecular weight by experimental technique

Other variables

Resolution by experimental technique

densityMatthews

densityPercentSol

Classification column

Top classifications

Conclusion

Alicia Key

Set theme and `dplyr` options

What if I dropped all `NA`?

Plan of action: drop rows with `NA` and filter year range