What is fftpipe? fftpipe is a package of functions that wrap around the base R fft() function. The fftpipe package enables workflows around the fft() function that use the pipe (%>%) operator. I took inspiration for the interface to fftpipe from the Tidyverse and tidymodels packages. Specifically, fftpipe offers the following functionality: Waveform generation, FFT and inverse FFT transformation, Plotting of these waveforms and FFTs. Installation Install fftpipe from GitHub with devtools.

Continue reading

R: PDB Data Exploration

PDB Data Exploration The Protein DataBank (PDB) stores files that contain the structure of “proteins, nucleic acids, and complex assemblies.” These structures are essential tools for research in structural biology, biochemistry, and related fields. I was recently browsing Kaggle for datasets and found a scrape of PDB data up to the year 2018 by Shahir. The data included sequence information as well as metadata about those sequences. Kaggle suggests these data as a multi-class classification exercise.

Continue reading

Introduction In my post “R: Deep Learning Organic Chemistry Again,” I trained a convolutional neural network based on VGG16 to recognize a benzene ring diagram, a crucial structure in many organic chemistry molecules. The classification problem I posed to the convnet was a binary classification to separate diagrams of molecules that contain a benzene ring from those that do not. However, near the end of that post, I found images I had mistakenly put in the wrong training and validation folders.

Continue reading

R: Water Potability

Water Potability In this post, I explore a dataset with observations of water sample properties and their corresponding drinkability. The dataset is from Aditya Kadiwal on Kaggle. In this analysis, I compare the performance of logistic, decision tree, and xgboost classification models by tracking each model’s ROC AUC metric. The xgboost model wins. In this post, I use the R tidymodels framework. Tidymodels aims to unify models and modeling engines to streamline machine learning workflows under a consistent interface.

Continue reading

Overview I was looking for a dataset on which to train a binary classifier, and I found data for dairy milk quality prediction on Kaggle. Posted by Shrijayan Rajendran and located at https://www.kaggle.com/datasets/cpluzshrijayan/milkquality, it has a single outcome variable that classifies milk into three qualities: “low,” “medium,” and “high.” Seven predictor variables accompany these quality ratings. Conveniently, there are no missing values in the dataset. Ultimately, I converted the three qualities into new categories with just two types.

Continue reading

Introduction In my post “Python: Deep Learning Organic Chemistry," I trained a convolutional neural network to recognize a diagram of a benzene ring, which is a crucial structure in many organic chemistry molecules. The classification problem I posed to the convnet was a binary classification to separate diagrams of molecules that contain a benzene ring from those that do not. Using Python, TensorFlow, and Keras, my experiment proceeded in three steps:

Continue reading

R: Solubility Clustering

In my prior aqueous solubility regression study, I did an exploratory data visualization and found intriguing plots of solubility versus other variables in the study. I didn’t perform any experimental modeling of those relationships in that study. Here, I followup by performing a cluster analysis of solubility relationships to help future regression modeling efforts. My question is: do clusters within each of these relationships explain each feature’s effect on solubility?

Continue reading

Author's picture

Alicia Key

I am passionate about data and science.

PhD student

Aurora, Colorado, USA