# Machine Learning 101: Predicting Drug Use Using Logistic Regression In R

Machine LearningModelingRTools & LanguagesLogistic Regressionposted by Leihua Ye December 24, 2019

Executive Summary Generalized Linear Models (GLM) Three types of link function: Logit, Probit, and Complementary log-log (cloglog) Building a logistic regression...

[20 Free ODSC Resources to Learn Machine Learning]

```library(readr)
library(dplyr)
drug_use <- drug_use %>% mutate_at(as.ordered, .vars=vars(Alcohol:VSA))
drug_use <- drug_use %>%
mutate(Gender = factor(Gender, labels=c(“Male”, “Female”))) %>%
mutate(Ethnicity = factor(Ethnicity, labels=c(“Black”, “Asian”, “White”,
“Mixed:White/Black”, “Other”,
“Mixed:White/Asian”,
“Mixed:Black/Asian”))) %>%
mutate(Country = factor(Country, labels=c(“Australia”, “Canada”, “New Zealand”,
“Other”, “Ireland”, “UK”,”USA”)))#create a new factor variable called recent_cannabis_use
drug_use = drug_use %>%
mutate(recent_cannabis_use=as.factor(ifelse(Cannabis>=”CL3",”Yes”,”No”)))#create a new tibble that includes a subset of the original variable
#data split into training and test sets
drug_use_subset <- drug_use %>% select(Age:SS, recent_cannabis_use)
set.seed(1)
traint.indices = sample(1:nrow(drug_use_subset),1500)
drug_use_train = drug_use_subset[traint.indices,]
drug_use_test = drug_use_subset[-traint.indices,]
dim(drug_use_train)
dim(drug_use_test)[1] 1500   13
[1] 385  13```
```#use logit as the link function
glm_fit = glm(recent_cannabis_use ~ .,data=drug_use_train,family=binomial(link= “logit”))
summary(glm_fit)Call:
glm(formula = recent_cannabis_use ~ ., family = binomial(link = "logit"),
data = drug_use_train)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-3.0024  -0.5996   0.1512   0.5410   2.7525

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                1.33629    0.64895   2.059 0.039480 *
Age                       -0.77441    0.09123  -8.489  < 2e-16 ***
GenderFemale              -0.65308    0.15756  -4.145 3.40e-05 ***
Education                 -0.41192    0.08006  -5.145 2.67e-07 ***
CountryNew Zealand        -1.24256    0.31946  -3.890 0.000100 ***
CountryOther               0.11062    0.49754   0.222 0.824056
CountryIreland            -0.50841    0.69084  -0.736 0.461773
CountryUK                 -0.88941    0.39042  -2.278 0.022720 *
CountryUSA                -1.97561    0.20101  -9.828  < 2e-16 ***
EthnicityAsian            -1.19642    0.96794  -1.236 0.216443
EthnicityWhite             0.65189    0.63569   1.025 0.305130
EthnicityMixed:White/Black 0.10814    1.07403   0.101 0.919799
EthnicityOther             0.66571    0.79791   0.834 0.404105
EthnicityMixed:White/Asian 0.48986    0.96724   0.506 0.612535
EthnicityMixed:Black/Asian13.07740  466.45641   0.028 0.977634
Nscore                    -0.08318    0.09163  -0.908 0.363956
Escore                    -0.11130    0.09621  -1.157 0.247349
Oscore                     0.64932    0.09259   7.013 2.33e-12 ***
Ascore                     0.09697    0.08235   1.178 0.238990
Cscore                    -0.30243    0.09179  -3.295 0.000984 ***
Impulsive                 -0.14213    0.10381  -1.369 0.170958
SS                         0.70960    0.11793   6.017 1.78e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1```
```#probit link function
glm_fit_probit = glm(recent_cannabis_use ~ .,data=drug_use_train,family = binomial(link = “probit”))
glm_fit_clog = glm(recent_cannabis_use ~ .,data=drug_use_train,family = binomial(link = “cloglog”))
prob_training_clog = predict(glm_fit_clog, type=”response”)```
```# compare logit and probit
plot(prob_training_logit,prob_training_probit,xlab = “Fitted Values of Logit Model”,ylab= “Fitted Values of Probit Model”, main= “Plot 1: Fitted Values for Logit and Probit Regressions”, pch=19, cex=0.2,abline(a=0,b=1,col=”red”))```
```# compare logit and cloglog
plot(prob_training_logit,prob_training_clog,xlab = “Fitted Values of Logit Model”,ylab= “Fitted Values of Cloglog Model”, main= “Plot 2: Fitted Values for Logit and Cloglog Regressions”, pch=19, cex=0.2,abline(a=0,b=1,col=”red”))```

[Related Article: 3 Common Regression Pitfalls in Business Applications]

Originally Posted Here

## Leihua Ye

Leihua is a Ph.D. Candidate in Political Science with a Master's degree in Statistics at the UC, Santa Barbara. As a Data Scientist, Leihua has six years of research and professional experience in Quantitative UX Research, Machine Learning, Experimentation, and Causal Inference. His research interests include: 1. Field Experiments, Research Design, Missing Data, Measurement Validity, Sampling, and Panel Data 2. Quasi-Experimental Methods: Instrumental Variables, Regression Discontinuity Design, Interrupted Time-Series, Pre-and-Post-Test Design, Difference-in-Differences, and Synthetic Control 3. Observational Methods: Matching, Propensity Score Stratification, and Regression Adjustment 4. Causal Graphical Model, User Engagement, Optimization, and Data Visualization 5. Python, R, and SQL Connect here: 1. http://www.linkedin.com/in/leihuaye 2. https://twitter.com/leihua_ye 3. https://medium.com/@leihua_ye

1