

Machine Learning 101: Predicting Drug Use Using Logistic Regression In R
Machine LearningModelingRTools & LanguagesLogistic Regressionposted by Leihua Ye December 24, 2019 Leihua Ye

Executive Summary
- Generalized Linear Models (GLM)
- Three types of link function: Logit, Probit, and Complementary log-log (cloglog)
- Building a logistic regression to predict drug use and compare these three types of GLM
In Machine Learning 101 courses, stats professors introduce GLM right after linear regression as the next stepping stone of becoming data scientists. GLM comes with several forms, and the most well-known ones are logit, probit, and cloglog.
[20 Free ODSC Resources to Learn Machine Learning]
These GLMs are well suited for classification questions: to be or not to be, to vote or not to vote, and to click or not to click.
Basics
Usually, GLM for binary data can be expressed in the following form:

where g represents a linear relation of the predictors (on the right) of the probability p, and g is a function which maps p ∈[0,1] to ℝ.
There are three ways of linking the components on the left and right.
Logit:

In words, the log form of p.
Probit:

In words, the inverse of the cumulative density function of the normal distribution.
Cloglog:

In words, the log form of the negative value of the log form of the probability of not happening. Confused? At least, I’m. The link function for this one is straightforward.
OK, let’s move on and build GLM models to predict who is more vulnerable to drug use and learn to read plots.
- Load, clean, and splitting the dataset
library(readr) drug_use <- read_csv(‘drug.csv’,col_names=c(‘ID’,’Age’,’Gender’,’Education’,’Country’,’Ethnicity’,’Nscore’,’Escore’,’Oscore’,’Ascore’,’Cscore’,’Impulsive’,’SS’,’Alcohol’,’Amphet’,’Amyl’,’Benzos’,’Caff’,’Cannabis’,’Choc’,’Coke’,’Crack’,’Ecstasy’,’Heroin’,’Ketamine’,’Legalh’,’LSD’,’Meth’,’Mushrooms’,’Nicotine’,’Semer’,’VSA’)) library(dplyr) drug_use <- drug_use %>% mutate_at(as.ordered, .vars=vars(Alcohol:VSA)) drug_use <- drug_use %>% mutate(Gender = factor(Gender, labels=c(“Male”, “Female”))) %>% mutate(Ethnicity = factor(Ethnicity, labels=c(“Black”, “Asian”, “White”, “Mixed:White/Black”, “Other”, “Mixed:White/Asian”, “Mixed:Black/Asian”))) %>% mutate(Country = factor(Country, labels=c(“Australia”, “Canada”, “New Zealand”, “Other”, “Ireland”, “UK”,”USA”)))#create a new factor variable called recent_cannabis_use drug_use = drug_use %>% mutate(recent_cannabis_use=as.factor(ifelse(Cannabis>=”CL3",”Yes”,”No”)))#create a new tibble that includes a subset of the original variable #data split into training and test sets drug_use_subset <- drug_use %>% select(Age:SS, recent_cannabis_use) set.seed(1) traint.indices = sample(1:nrow(drug_use_subset),1500) drug_use_train = drug_use_subset[traint.indices,] drug_use_test = drug_use_subset[-traint.indices,] dim(drug_use_train) dim(drug_use_test)[1] 1500 13 [1] 385 13
So, the train set has a 1500*13 dimension, and the test set has a 385*13 dimension.
2. Fit a logistic regression
#use logit as the link function glm_fit = glm(recent_cannabis_use ~ .,data=drug_use_train,family=binomial(link= “logit”)) summary(glm_fit)Call: glm(formula = recent_cannabis_use ~ ., family = binomial(link = "logit"), data = drug_use_train) Deviance Residuals: Min 1Q Median 3Q Max -3.0024 -0.5996 0.1512 0.5410 2.7525 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.33629 0.64895 2.059 0.039480 * Age -0.77441 0.09123 -8.489 < 2e-16 *** GenderFemale -0.65308 0.15756 -4.145 3.40e-05 *** Education -0.41192 0.08006 -5.145 2.67e-07 *** CountryCanada -0.67373 1.23497 -0.546 0.585377 CountryNew Zealand -1.24256 0.31946 -3.890 0.000100 *** CountryOther 0.11062 0.49754 0.222 0.824056 CountryIreland -0.50841 0.69084 -0.736 0.461773 CountryUK -0.88941 0.39042 -2.278 0.022720 * CountryUSA -1.97561 0.20101 -9.828 < 2e-16 *** EthnicityAsian -1.19642 0.96794 -1.236 0.216443 EthnicityWhite 0.65189 0.63569 1.025 0.305130 EthnicityMixed:White/Black 0.10814 1.07403 0.101 0.919799 EthnicityOther 0.66571 0.79791 0.834 0.404105 EthnicityMixed:White/Asian 0.48986 0.96724 0.506 0.612535 EthnicityMixed:Black/Asian13.07740 466.45641 0.028 0.977634 Nscore -0.08318 0.09163 -0.908 0.363956 Escore -0.11130 0.09621 -1.157 0.247349 Oscore 0.64932 0.09259 7.013 2.33e-12 *** Ascore 0.09697 0.08235 1.178 0.238990 Cscore -0.30243 0.09179 -3.295 0.000984 *** Impulsive -0.14213 0.10381 -1.369 0.170958 SS 0.70960 0.11793 6.017 1.78e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretations of are straightforward, and significant variables
include: age, female, education, country variable (ZJ, UK, USA), oscore, score, and SS.
3. Probit and cloglog
#probit link function glm_fit_probit = glm(recent_cannabis_use ~ .,data=drug_use_train,family = binomial(link = “probit”)) prob_training_probit = predict(glm_fit_probit, type=”response”)#c-log-log” link glm_fit_clog = glm(recent_cannabis_use ~ .,data=drug_use_train,family = binomial(link = “cloglog”)) prob_training_clog = predict(glm_fit_clog, type=”response”)
4. Compare these three plots
# compare logit and probit
plot(prob_training_logit,prob_training_probit,xlab = “Fitted Values of Logit Model”,ylab= “Fitted Values of Probit Model”, main= “Plot 1: Fitted Values for Logit and Probit Regressions”, pch=19, cex=0.2,abline(a=0,b=1,col=”red”))

As well known, probit and logit predict almost the same values as they aligh closely on the 45-degree line. Probably, the only difference lies in the middle range between 0.5 to 0.8 where the probit model predicts value slightly below the abline.
# compare logit and cloglog
plot(prob_training_logit,prob_training_clog,xlab = “Fitted Values of Logit Model”,ylab= “Fitted Values of Cloglog Model”, main= “Plot 2: Fitted Values for Logit and Cloglog Regressions”, pch=19, cex=0.2,abline(a=0,b=1,col=”red”))

This is an interesting plot. C-loglog generates predictions higher value in the early stage, followed by lower dispersed predictions than the logit in the middle range.
[Related Article: 3 Common Regression Pitfalls in Business Applications]
In Machine Learning, logistic regression serves as the 101 technique that data scientists can apply.
Originally Posted Here