

WHO Tuberculosis Data & ggplot2
BlogData VisualizationRposted by Eugene Joh August 2, 2017 Eugene Joh

So it has been a while since my previous post on some data taken from the UNHCR database. This post we’ll bring it back to the topic of infectious diseases (check out my other posts on the SIR model and MRSA typing). For this post, as similar to previous ones, I give a guide through the process I took to organize the data, a nifty function I created to search the documentation and a visualization of the data using ggplot2.

What is Tuberculosis?
Tuberculosis (TB) is a nasty disease caused by the bacterium Mycobacterium tuberculosis. It was first described by Hippocrates and has a history concurrent with human civilization. Individuals infected with TB usually do not develop active disease but if they develop active TB, they can on average infect 8-10 people. Most of the time when we hear of TB, we think of the infection within the lungs, but I learned during my time volunteering in a medical clinic in Peru that TB can manifest in other sites of the body (extra-pulmonary TB). This bacterium has infected roughly a third of the world’s population and is a top 10 cause of death around the globe.
There are two major modern concerns with TB…
- TB and HIV co-infection, where there is a disproportionate mortality rate by TB with HIV infected individuals. HIV infects CD4+ T-cells, which are essential for an appropriate immune response to combat TB. This relationship has been reported in patients.
- Multi-drug resistant TB (MDR-TB) is a growing public health concern that threatens the control of TB infection throughout the world. There have been reports of extensive-drug resistant TB (XDR-TB) which is resistant to the standard treatment of isoniazid and rifampin alongside any fluoroquinolone and at least one secondary-treatment drug. Fewer treatment options put patients under rigorous and strenuous regimens and present a risk of transmission to others.
Data Retrieval
The data I used for this post is taken from the World Health Organization from this link to the TB data. Specifically I used the data dictionary and TB burden estimates. When you click these links, they generate a comma-delimited file (.csv) and date-stamp the file. I downloaded these files back in February 2016. Just remember where you save these in your local directory so you can read them into R for the near future.
Getting Started
The code I use for my posts can be found on my GitHub account, open for free use and adaptation. I’m always open to comments and suggestions on how to improve my code or even a completely different method to reach the same goal. Optimization of R programming is something I am working towards bit by bit.
After setting up your working directory, read your comma-delimited files using the read.csv()function. The WHO files are conveniently formatted so there is no need for elaborate arguments for headers, field separators or assigning missing values.
1 2 | TB.burden <- read.csv ( "TB_burden_countries_2016-02-18.csv" ) #TB burden dataset TB.dic <- read.csv ( "TB_data_dictionary_2016-02-18.csv" ) #TB documentation dictionary |
Customized Search Function
While completing an R assignment for one of my classes, I ran into the readline() function which allows for some interactive functionality within R. The reason I created this interactive custom search function is because of the fact that the column variables in the TB burden .csv file are not completely straightforward and it’s annoying to manually search through the file itself in Microsoft Excel. So instead of memorizing all of them (only to forget after I write 10 lines of code), I decided to create this function whenever I start to write code for a new plot and need to remind myself the exact description of the variable I am working with…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | TB.search <- function (x){ # look <- names (x) #assign the names/column names of the subsetted data message ( "Matching variable names found in the documentation files" ) TB.d <- as.character (TB.dic[,4]) # conversion to character to have "clean" output out <- look[look % in % TB.dic[,1]] #assignment of vector: which names of the selected subset are in the documentation file print (out) #check which variable names from dictionary match the input message ( "Above includes all names for variable names found in the documentation files" ) #message prompt for output z <- readline ( "Search Variable Definition (CASE-SENSITIVE): " ) #interactive read-in of variable name if (z % in % out){ # NEED TO MAKE THE CONDITION TO SEPARATE integer(0) from real integer values print (TB.d[ grep ( paste0 ( "^(" ,z, "){1}?" ),TB.dic[,1])]) #print the definition found in TB.d } else { repeat { #repeat the line below for readline() if exact match of doesn't exist y<- readline ( "Please re-input (press 'ESC' to exit search): " ) #second prompt if (y % in % out) break # the condition that ends the repeat, that the input matches a name in the documentation } print (TB.d[ grep ( paste0 ( "^" ,y, "{1}?" ),TB.dic[,1])]) #print the definition found in documentation file TB.dic } } |
I’m only going to highlight a couple things in the code above. I converted the definitions found in the documentation file because the original data type was a factor. This was a problem with the print output because it would show all the unnecessary information on the number of factors and values of the factors I had no interest in. Secondly I ended up using a repeat in my if statement to prompt the user if they incorrectly typed in a variable name in the console. This link provides some helpful information on the repeat loop. Other then that, I used regular expression to specify the search to match what was actually in the documentation file.
Cleaning Data
As with all data, it doesn’t come in the form that you always want it in. Luckily the WHO TB data wasn’t in bad shape and I just converted all integer valued columns into numeric types using a simple loop. (Comment if you know how to do this using the apply family!)
1 2 3 4 5 | for (k in 1: length ( names (TB.burden))){ #convert all the rows with type 'integer' to 'numeric if ( is.integer (TB.burden[,k])){ TB.burden[,k] <- as.numeric (TB.burden[,k]) } } |
After that, I used the mapvalues function from the plyr package to change the levels in the country variable. I used this because the basic functions in R require me to specify all the levels in the factor and using 219 country names seemed tedious. Quite simply, I just listed the old names I wanted to change and created a respective character vector of the the new country names.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | library (plyr) TB.burden$country <- ( mapvalues ((TB.burden$country), from = c ( "Bolivia (Plurinational State of)" , "Bonaire, Saint Eustatius and Saba" , "China, Hong Kong SAR" , "China, Macao SAR" , "Democratic People's Republic of Korea" , "Democratic Republic of the Congo" , "Iran (Islamic Republic of)" , "Lao People's Democratic Republic" , "Micronesia (Federated States of)" , "Republic of Korea" , "Saint Vincent and the Grenadines" , "Sint Maarten (Dutch part)" , "The Former Yugoslav Republic of Macedonia" , "United Kingdom of Great Britain and Northern Ireland" , "United Republic of Tanzania" , "Venezuela (Bolivarian Republic of)" , "West Bank and Gaza Strip" ) , to = c ( "Bolivia" , "Caribbean Netherlands" , "Hong Kong" , "Macao" , "North Korea" , "DRC" , "Iran" , "Laos" , "Micronesia" , "South Korea" , "St. Vincent &amp;amp;amp; Grenadines" , "Sint Maarten" , "Former Yugoslav (Macedonia)" , "UK &amp;amp;amp; Northern Ireland" , "Tanzania" , "Venezuela" , "West Bank and Gaza" ) )) |
I threw in some random subsetting into the code to pull out values for specific questions (ie. what was the estimated prevalence per 100,000 people of TB in Peru during 2014). You can see it in the actual code in the GitHub repository.
Visualizing Data
I was interested in what estimated rates of TB were by WHO region, so I ran a simple loop to assign new variable names to each region (this is not the best practice IMO but it does the trick).
1 2 3 4 5 6 | for (i in levels (TB.burden$g_whoregion)){ #selection of the WHO regions (AFR, AMR, EMR, EUR, SEA, WPR) nam <- paste0 (i, "_whoregion" ) #creation of the name based on acronyms from levels assign (nam, subset (TB.burden,TB.burden$g_whoregion % in % paste (i))) #assigning each level a subset based on the region } ls () #just to show what variables are stored... #look for AFR_whoregion, AMR_whoregion, EMR_whoregion, EUR_whoregion, SEA_whoregion, WPR_whoregion |
Now we set up the themes we want to use for ggplot2. I set the year scale for the legends, customized the text on the axes and assigned dark_t to be a dark, black-grey background and light_t as a simple, white-grey background theme.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ctext <- theme (axis.ticks = element_blank (), plot.title= element_text (family= "Verdana" ), axis.text.x= element_text (size=9,family= "Verdana" ), #http://www.cookbook-r.com/Graphs/Fonts/#table-of-fonts axis.title.x= element_text (size=10) axis.text.y= element_text (size=7.5,family= "Verdana" ), axis.title.y- element_text (size=10,family= "Verdana" ), legend.title= element_text (size=9,family= "Verdana" ,vjust=0.5), #specify legend title color legend.text= element_text (size=7,family= "Verdana" ,face= "italic" ), ) dark_t <- theme_minimal ()+ theme ( plot.background = element_rect (fill= "#191919" ), #plot background panel.border = element_blank (), #removes border panel.background = element_rect (fill = "#000000" ,colour= "#000000" ,size=2), #panel background and border panel.grid = element_line (colour = 1 |