Max Kuhn

Max Kuhn, Director - Pfizer

Director, Pfizer

Bio: I am a Ph.D. statistician with experience in a few different domains: pharmaceutical non-clinical statistics, molecular diagnostic R&D, assay development, manufacturing support/Six Sigma and clinical statistics (in order of my interests). I have worked both as an individual contributor as well as directing moderate sized groups (currently 2 people, 12 previously). I prefer problems where creativity is a key to problem solving. For example, predictive modeling (i.e. machine learning) is an area where traditional statistics may be limiting; as long as you can prove that a solution performs well, any idea is on the table. Complex problems interest me. A friend once remarked that I thrive on making order out of chaos and, to some extent, this is true. One of the most overlooked skills in these situations is the ability to remained focused on the core objective. This is especially true when we have as access to large amounts of data. Specialties: Predictive modeling/machine learning/pattern recognition | Computational biology and chemistry | High dimensional biology | The design and analysis of experiments

Intro to Caret, Model Training and Tuning

Intro to Caret, Model Training and Tuning

Contents Model Training and Parameter Tuning An Example Basic Parameter Tuning Notes on Reproducibility Customizing the Tuning Process Pre-Processing Options Alternate Tuning Grids Plotting the Resampling Profile The trainControl Function Alternate Performance Metrics Choosing the Final Model Extracting Predictions and Class Probabilities Exploring and Comparing Resampling Distributions Within-Model Between-Models Fitting Models Without Parameter Tuning 5.1 Model Training and […]

Intro to Caret: Data Splitting

Intro to Caret: Data Splitting

Contents Simple Splitting Based on the Outcome Splitting Based on the Predictors Data Splitting for Time Series Data Splitting with Important Groups 4.1 Simple Splitting Based on the Outcome The function createDataPartition can be used to create balanced splits of the data. If the yargument to this function is a factor, the random sampling occurs within each class and […]

Do Resampling Estimates Have Low Correlation to the Truth?

Do Resampling Estimates Have Low Correlation to the Truth?

The Answer May Shock You. One criticism that is often leveled against using resampling methods (such as cross-validation) to measure model performance is that there is no correlation between the CV results and the true error rate. Let’s look at this with some simulated data. While this assertion is often correct, there are a few […]

Intro to Caret: Pre-Processing

Intro to Caret: Pre-Processing

Editor’s note: This is the third of a series of posts on the caret package. Creating Dummy Variables Zero- and Near Zero-Variance Predictors Identifying Correlated Predictors Linear Dependencies The preProcess Function Centering and Scaling Imputation Transforming Predictors Putting It All Together Class Distance Calculations caret includes several functions to pre-process the predictor data. It assumes that […]

Intro to caret: Visualizations

Intro to caret: Visualizations

Editor’s note: This is the second of a series of posts on the caret package. The featurePlot function is a wrapper for different lattice plots to visualize the data. For example, the following figures show the default plot for continuous outcomes generated using the featurePlotfunction. For classification data sets, the iris data are used for illustration. […]

The caret Package

The caret Package

Editor’s note: This is the first of a long series of posts on the caret package. Introduction The caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting pre-processing feature selection model tuning using resampling […]

Optimizing with Nonlinear Programming

Optimizing with Nonlinear Programming

Rafael Ladeira asked on github: I was wondering why it doesn’t implement some others algorithms for search for optimal tuning parameters. What would be the caveats of using a genetic algorithm , for instance, instead of grid or random search? Do you think using some of those powerful optimization algorithms for tuning parameters is a […]

Three Aspects of Predictive Modeling

Three Aspects of Predictive Modeling

These slides were originally posted on appliedpredictivemodeling.com, and were kindly contributed to Open Data Science. Link to presentation: Three Aspects of Predictive Modeling By: Max Kuhn, Ph.D Presentation Overview: “Predictive modeling” definition Some example applications A short overview and example How is this dierent from what statisticians already do? Unmet challenges in applied modeling Predictive […]