Regular Expression & Treemaps to Visualize Emergency Department Visits Regular Expression & Treemaps to Visualize Emergency Department Visits
It’s been a while since my last post on some TB WHO data. A lot has happened since then, including the opportunity to attend the... Regular Expression & Treemaps to Visualize Emergency Department Visits

It’s been a while since my last post on some TB WHO data. A lot has happened since then, including the opportunity to attend the Open Data Science Conference (ODSC) East held in Boston, MA. Over a two day period I had the opportunity to listen to a number of leaders in various industries and fields. It was inspiring to learn about the wide variety of data science applications ranging from finance and marketing to genomics and even the refugee crisis.

One of the workshops at ODSC was text analytics, which includes basic text processing, dendrograms, natural language processing and sentiment analysis. This gave me the thought of applying some text analytics to visualize some data I was working on last summer. In this post I’m going to walk through how I used regular expression to label classification codes in a large dataset (NHAMCS) representing emergency department visits in the United States and eventually visualize the data.

The NHAMCS and ICD-9 Codes

Without going into too much detail, the National Health Ambulatory Medical Care Survey (NHAMCS) is large sample survey used to provide an estimate of visits to outpatient and emergency departments in general/short-stay hospital at a national scale for the United States. Those who are well versed in complex probability-sample survey designs (or interested in it), you can read about how it’s set up here. For this post I only using the emergency department data, not outpatient. The main thing to keep in mind moving forward is that the survey’s basic unit of measurement are visits or encounters made in the United States to emergency departments.

Diagnostic information on each visit is collected and coded in the NHAMCS by using the International Classification of Diseases (ICD) codes maintained by the World Health Organization. Specifically the NHAMCS public data files (1992-2014) use the ICD, 9th revision, clinical modification (ICD-9-CM). The 10th revision being currently used throughout the world since 2015 and the 11th revision is in-process, most likely it will be disseminated in 2018. You can check out this package I recently came across that validates and searches co-morbidities of ICD-9 and ICD-10 codes in R.

The Data Game Plan

In this post, the code is only for the 2005 emergency department (ED) NHAMCS dataset. If you interested in getting the original data you can go to this page. There is extensive documentation on the NHAMCS website that helps you download the data for use in SAS, Stata and SPSS. From my current knowledge (hours searching the vast interweb) there is no method in R to download these files. If someone wants to take the challenge of developing a method of pulling down the data using R, go for it! For this post, the data I’m using were Stata files and I used an R package to read-in those files. I’ve included subsetted data for ED NHAMCS from 2005 to 2009 in my GitHub repository if anyone is interested in using those files with my code.

To achieve our goal of visualizing diagnoses in ED visits we need to obtain, clean, and then appropriately setup our data. First we’re going to parse the text in official ICD-9-CM documentation using regular expression to create our own ICD dictionary/lookup table. Next, we’ll take the 2005 ED NHAMCS data, apply the survey design to get the correct estimates per ED visit. Then we’ll use our ICD dictionary to define each code, set it all up using the data.table() package to create a Treemap visualization. I’ll be explaining a fair amount on data cleaning and regular expression, if you want to see the treemap you can scroll down to the end of the post.

Using Regex on ICD-9 Codes

Regular expression or regex is not unique just to R but is used across other types of programming languages (Java, Python, Perl). Simply put, regex is used to extract patterns of characters in a string. There is a good tutorial video here on regex and another video here on using regex in R. Disclaimer: I am not an expert on regex! I am still learning and this post is based on my current knowledge I’ve gathered so far. A good and helpful source I’ve been using  is Rexegg. Regardless, let’s march forward towards our goal!

I’m going to take the 3-digit grouping ICD-9-CM documentation from the CDC’s Health Statistics website, where I downloaded the DINDEX11.zip file. Because of the annoying nature of the rich-text files (RTF), I just copied the entire document into a basic UTF-8 text file called Dc_3d10.txt. My entire code is in my GitHub, but I’ll be referring to particular sections in this post. So moving forward, I’ll read this into R using the readLines function and this output…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
icd9.3 <- readLines("Dc_3d10.txt") > head(icd9.3,15)
 [1] "Appendix Eu2028List of Three-Digit Categories"
 [2] "1.tINFECTIOUS AND PARASITIC DISEASES"
 [3] "Intestinal infectious diseases (001-009)"
 [4] "001tCholera"
 [5] "002tTyphoid and paratyphoid fevers"
 [6] "003tOther salmonella infections"
 [7] "004tShigellosis"
 [8] "005tOther food poisoning (bacterial)"
 [9] "006tAmebiasis"
[10] "007tOther protozoal intestinal diseases"
[11] "008tIntestinal infections due to other organisms"
[12] "009tIll-defined intestinal infections"
[13] "Tuberculosis (010-018)"
[14] "010tPrimary tuberculous infection"
[15] "011tPulmonary tuberculosis"

It looks like there is a header/title at [1], numeric grouping  at [2] “1.tINFECTIOUS AND PARASITIC DISEASES”,  subgrouping by ICD-9 code ranges, at [3] “Intestinal infectious diseases (001-009)” and then 3-digit ICD-9 codes followed by a specific diagnosis, at [10] “007tOther protozoal intestinal diseases”. At the end we want to produce three separate data frames that we’ll categorize as:

  1. Groups: the title which contains the general diagnosis grouping
  2. Subgroups: the range of ICD-9 codes that contain a certain diagnosis subgroup
  3. Classification: the specific 3-digit ICD-9 code that corresponds with a diagnosis

First we’ll use a simple operation to remove the first line:

1
icd9.3 <- icd9.3[-1]

Next we’ll use regex to extract the Groups using this…

1
icd9.3[grep("^[0-9]+\.",icd9.3)]

The grep() function is used to extract patterns based on regex. To explain the regex, I am selecting lines that start with a digit of any length which is immediately followed by a period. The carat “^” initiates the pattern matching from the start of the string (going left to right). The [0-9] represents any digit from 0 through 9 and the plus sign “+” denotes at least one repeat of any digit. The two backslashes “\” is required because periods in regex are special meta-characters that represent any character (number, letter, symbol). To summarize, this identifies strings that start with “1.”, “2.”, “3.”, …, “17.”, and beyond.

Next I use the gsub() function, which replaces or substitutes based on regex to replace all the “t” in the group names and replace it with two underscore “__” which we’ll use later as an identifier to split the string in half to make the data frame (if this isn’t clear right now, hopefully it will be when we do the splitting).

1
2
3
4
icd9.3g <- gsub("\.t","__",icd9.3g) > icd9.3g[1:3]
[1] "1__INFECTIOUS AND PARASITIC DISEASES"
[2] "2__NEOPLASMS"
[3] "3__ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS"

This regex removes all “t” specifically when it follows a period. Here we use the “\” again to identify a period.

Now we’ll split the string into separate columns within a data frame using the str_split_fixedfunction in the stringr package. I set stringsAsFactors as FALSE so that we maintain the data type of character with the strings. You can read about how this code has caused headaches for many users here. Here I assign Group.n for the number

1
2
3
4
5
6
7
8
9
10
diag.g <- as.data.frame(stringr::str_split_fixed(icd9.3g,"__",2),stringsAsFactors = F)
names(diag.g) <- c("Group.n","Group")
diag.g <- diag.g[c("Group","Group.n")] #swap order of columns > head(diag.g)
                                                                  Group Group.n
1                                     INFECTIOUS AND PARASITIC DISEASES     1
2                                                             NEOPLASMS     2
3 ENDOCRINE, NUTRITIONAL AND METABOLIC DISEASES, AND IMMUNITY DISORDERS     3
4                            DISEASES OF BLOOD AND BLOOD-FORMING ORGANS     4
5                                                      MENTAL DISORDERS     5
6                       DISEASES OF THE NERVOUS SYSTEM AND SENSE ORGANS     6

Now we have a two-column data frame that contains the group number and name of that group so we can refer to for the future. However, later on (I realized when I got there) I needed to create additional columns to account for the range of 3-digit values that correspond to each Group. To do this I used the grep() function to first get the names of each Group and the index of the Group titles in the text file by changing the value argument in grep().

1
2
groupn <- grep("\.t",icd9.3,value=T) #names
groupi <- grep("\.t",icd9.3,value=F) #index

Afterwards I find the lower and upper limit of each Group range by using the index and simple vector arithmetic in combination of gsub() to extract only numbers. For the upper limit and added the maximum value of 999 (only 3-digits remember!) so that the lengths match, since I didn’t extract it from the code above. Then create the final dictionary for the Groups.

1
2
3
4
5
6
low.g <- gsub("[^0-9]",",icd9.3[groupi+2]) #lower limit of range
up.g <- c(gsub("[^0-9]",",icd9.3[groupi-1]),"999") #upper limit of range
diag.g <-data.frame(diag.g,low.g,up.g, stringsAsFactors = F)
names(diag.g) <- c("Group","Group.n","Start","End")

We move on to the Subgroups which we will extract based on the fact that all the corresponding lines have a parentheses contain a pattern of… 3 digits, a dash then another 3 digits. Ie. “Fracture of skull (800-804)” The appropriate regex is as follows…

1
icd9.3sg <- icd9.3[grep("\([0-9]{3}-[0-9]{3}\)",icd9.3)]

The double backslashes are used for parentheses (just like for periods) and I specify the pattern of any 3 digits using the “{}” curly brackets. Next we remove any lingering “t” that might exist and replace with a space ” “. After that we’ll insert the “__” split identifier we used above. This is specified by using regex to match the pattern of 3-digit parentheses that follows directly after a space, which is denoted as “\s”.

1
2
icd9.3sg <- gsub("t"," ",icd9.3sg)
icd9.3sg <- gsub("\s\(([0-9]{3}-[0-9]{3})\)","__\1",icd9.3sg)

Like we did for the Group above, we’ll split the based on the “__” split identifier.

1
2
3
diag.sg <- as.data.frame(
  stringr::str_split_fixed(icd9.3sg,"__",2),
  stringsAsFactors = F)

After that, we even want to split the ranges by the dash “-“ so we can get the ‘start’ and ‘end’ of the ICD-9 code range for each Subgroup. Then we’ll assign the names of the columns in the newly created data frame.

1
2
3
4
5
6
7
8
9
10
11
diag.sg <- as.data.frame(
  cbind(
    diag.sg,
    stringr::str_split_fixed(diag.sg[,2],"-",2),
  stringsAsFactors = F))
names(diag.sg) <- c("Subgroup","Range","Start","End") > head(diag.sg,4)
                        Subgroup &nbsp; Range Start End
1 Intestinal infectious diseases 001-009   001 009
2                   Tuberculosis 010-018   010 018
3    Zoonotic bacterial diseases 020-027   020 027
4       Other bacterial diseases 030-041   030 041

Now we’ll extract the Classifications. We want to generate a regex that captures a pattern of:

  1. A 3-digit sequence followed by a “t”,
  2. A capital “V” followed by a 2-digit sequence and then by a “t”
  3. A capital “E” followed by a 2-digit sequence and then by a “t”

I ended up ignoring the V’s and E’s after the data.table manipulation I use later on! We can use the “|” or the OR operator in regex to capture these 3 patterns. Using similar regex from the Groups and Subgroups we generate this…

1
icd9.3c <- icd9.3[grep("(^[0-9]{3}t)|(^V[0-9]{2}t)|(E[0-9]{3}t)",icd9.3)]

And then we can do the same method as the Group to remove any lingering “t” and split the string to get an output of a data frame. Also

1
2
3
4
5
6
7
8
9
10
11
icd9.3c <- gsub("t","__",icd9.3c)
diag.c <- as.data.frame(stringr::str_split_fixed(icd9.3c,"__",2),stringsAsFactors = F)
names(diag.c) <- c("Code","Classification")
diag.c <- diag.c[c("Classification","Code")] > head(diag.c)
                    Classification Code
1                          Cholera  001
2   Typhoid and paratyphoid fevers  002
3      Other salmonella infections  003
4                      Shigellosis  004
5 Other food poisoning (bacterial)  005
6                        Amebias

Eugene Joh

Eugene Joh

A detail-oriented MS and MPH graduate with a strong quantitative background, motivated to use his current and developing skill set to improve the health of vulnerable populations in local and global settings.


Notice: Undefined variable: logo_content in /home/odsc1733/public_html/opendatascience.com/wp-content/plugins/mobile-menu/includes/class-wp-mobile-menu-core.php on line 310