fbpx
Machine Learning 101: Predicting Drug Use Using Logistic Regression In R Machine Learning 101: Predicting Drug Use Using Logistic Regression In R
Executive Summary Generalized Linear Models (GLM) Three types of link function: Logit, Probit, and Complementary log-log (cloglog) Building a logistic regression... Machine Learning 101: Predicting Drug Use Using Logistic Regression In R

[20 Free ODSC Resources to Learn Machine Learning]



library(readr)
drug_use <- read_csv(‘drug.csv’,col_names=c(‘ID’,’Age’,’Gender’,’Education’,’Country’,’Ethnicity’,’Nscore’,’Escore’,’Oscore’,’Ascore’,’Cscore’,’Impulsive’,’SS’,’Alcohol’,’Amphet’,’Amyl’,’Benzos’,’Caff’,’Cannabis’,’Choc’,’Coke’,’Crack’,’Ecstasy’,’Heroin’,’Ketamine’,’Legalh’,’LSD’,’Meth’,’Mushrooms’,’Nicotine’,’Semer’,’VSA’))
library(dplyr)
drug_use <- drug_use %>% mutate_at(as.ordered, .vars=vars(Alcohol:VSA)) 
drug_use <- drug_use %>%
 mutate(Gender = factor(Gender, labels=c(“Male”, “Female”))) %>%
 mutate(Ethnicity = factor(Ethnicity, labels=c(“Black”, “Asian”, “White”,
 “Mixed:White/Black”, “Other”,
 “Mixed:White/Asian”,
 “Mixed:Black/Asian”))) %>%
 mutate(Country = factor(Country, labels=c(“Australia”, “Canada”, “New Zealand”, 
 “Other”, “Ireland”, “UK”,”USA”)))#create a new factor variable called recent_cannabis_use
drug_use = drug_use %>% 
mutate(recent_cannabis_use=as.factor(ifelse(Cannabis>=”CL3",”Yes”,”No”)))#create a new tibble that includes a subset of the original variable 
#data split into training and test sets
drug_use_subset <- drug_use %>% select(Age:SS, recent_cannabis_use)
set.seed(1)
traint.indices = sample(1:nrow(drug_use_subset),1500)
drug_use_train = drug_use_subset[traint.indices,]
drug_use_test = drug_use_subset[-traint.indices,]
dim(drug_use_train)
dim(drug_use_test)[1] 1500   13
[1] 385  13
#use logit as the link function
glm_fit = glm(recent_cannabis_use ~ .,data=drug_use_train,family=binomial(link= “logit”))
summary(glm_fit)Call:
glm(formula = recent_cannabis_use ~ ., family = binomial(link = "logit"), 
    data = drug_use_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0024  -0.5996   0.1512   0.5410   2.7525  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                1.33629    0.64895   2.059 0.039480 *  
Age                       -0.77441    0.09123  -8.489  < 2e-16 ***
GenderFemale              -0.65308    0.15756  -4.145 3.40e-05 ***
Education                 -0.41192    0.08006  -5.145 2.67e-07 ***
CountryCanada             -0.67373    1.23497  -0.546 0.585377    
CountryNew Zealand        -1.24256    0.31946  -3.890 0.000100 ***
CountryOther               0.11062    0.49754   0.222 0.824056    
CountryIreland            -0.50841    0.69084  -0.736 0.461773    
CountryUK                 -0.88941    0.39042  -2.278 0.022720 *  
CountryUSA                -1.97561    0.20101  -9.828  < 2e-16 ***
EthnicityAsian            -1.19642    0.96794  -1.236 0.216443    
EthnicityWhite             0.65189    0.63569   1.025 0.305130    
EthnicityMixed:White/Black 0.10814    1.07403   0.101 0.919799    
EthnicityOther             0.66571    0.79791   0.834 0.404105    
EthnicityMixed:White/Asian 0.48986    0.96724   0.506 0.612535    
EthnicityMixed:Black/Asian13.07740  466.45641   0.028 0.977634    
Nscore                    -0.08318    0.09163  -0.908 0.363956    
Escore                    -0.11130    0.09621  -1.157 0.247349    
Oscore                     0.64932    0.09259   7.013 2.33e-12 ***
Ascore                     0.09697    0.08235   1.178 0.238990    
Cscore                    -0.30243    0.09179  -3.295 0.000984 ***
Impulsive                 -0.14213    0.10381  -1.369 0.170958    
SS                         0.70960    0.11793   6.017 1.78e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#probit link function
glm_fit_probit = glm(recent_cannabis_use ~ .,data=drug_use_train,family = binomial(link = “probit”))
prob_training_probit = predict(glm_fit_probit, type=”response”)#c-log-log” link
glm_fit_clog = glm(recent_cannabis_use ~ .,data=drug_use_train,family = binomial(link = “cloglog”))
prob_training_clog = predict(glm_fit_clog, type=”response”)
# compare logit and probit
plot(prob_training_logit,prob_training_probit,xlab = “Fitted Values of Logit Model”,ylab= “Fitted Values of Probit Model”, main= “Plot 1: Fitted Values for Logit and Probit Regressions”, pch=19, cex=0.2,abline(a=0,b=1,col=”red”))
# compare logit and cloglog
plot(prob_training_logit,prob_training_clog,xlab = “Fitted Values of Logit Model”,ylab= “Fitted Values of Cloglog Model”, main= “Plot 2: Fitted Values for Logit and Cloglog Regressions”, pch=19, cex=0.2,abline(a=0,b=1,col=”red”))

[Related Article: 3 Common Regression Pitfalls in Business Applications]


Originally Posted Here

Leihua Ye

Leihua is a Ph.D. Candidate in Political Science with a Master's degree in Statistics at the UC, Santa Barbara. As a Data Scientist, Leihua has six years of research and professional experience in Quantitative UX Research, Machine Learning, Experimentation, and Causal Inference. His research interests include: 1. Field Experiments, Research Design, Missing Data, Measurement Validity, Sampling, and Panel Data 2. Quasi-Experimental Methods: Instrumental Variables, Regression Discontinuity Design, Interrupted Time-Series, Pre-and-Post-Test Design, Difference-in-Differences, and Synthetic Control 3. Observational Methods: Matching, Propensity Score Stratification, and Regression Adjustment 4. Causal Graphical Model, User Engagement, Optimization, and Data Visualization 5. Python, R, and SQL Connect here: 1. http://www.linkedin.com/in/leihuaye 2. https://twitter.com/leihua_ye 3. https://medium.com/@leihua_ye

1