fbpx
A Data Pattern with an R data.table Solution. A Data Pattern with an R data.table Solution.
Summary: This blog examines a loading pattern seen often with government-generated, web-accessible data. The data comprise millions of records across multiple... A Data Pattern with an R data.table Solution.

Summary: This blog examines a loading pattern seen often with government-generated, web-accessible data. The data comprise millions of records across multiple text or csv files, generally demarcated by time. The files may present different, but overlapping, attributes, while much of the data has a character representation, posing the challenge of how to optimally manifest it in R (or Python-Pandas) — either as character or “factor”. In the analysis that follows, I consider an R data.table example. The to-dos are to load files that don’t all possess the same attributes, and to optimize storage for character variables. The main finding is that factor representations of character variables should probably be preferred in cases like this one. Those with an intermediate level of R programming skill will best benefit from the code that follows.

[More from the writer: Creating if/elseif/else Variables in Python/Pandas]

I’m working currently with a large data set from the Center for Medicare and Medicaid Services (CMS). A few years ago, I analyzed other CMS data that presented a similar pattern:
1) Multiple csv/text files delineated by time.
2) Preponderance of character/string data types, along with a smattering of dates, booleans, and numerics.
3) The various csv/text files may or may not have identical attributes, though there’s much commonality.
4) The data from the union of all columns is to be maintained, with NA’s assigned for missings.
5) The combined data is often quite large by notebook standards — in excess of 5 GB.
6) And since R/Python-Pandas are memory intensive, it’s critical to optimize storage demands.

The challenge I take on below is to efficiently load and store the data from the six large open payments text files into a single R data.table, accommodating an overlapping but not identical attribute structure. In addition, I contrast storing character data attributes as either character or factor data types.

The factor is an R data type used to represent categories of data. The categories (levels) themselves may be any data type but are most often character or text. Factor values on the other hand are denoted by a vector of integers that “point” to the categories. An R example would be variable gender, consisting of levels c(“Female”,”Male”), represented by values c(1,2). One obvious advantage of factors over text data is that the labels are stored just once, potentially saving storage. A second advantage is that factor ordering is flexible based on how they’re constructed.

Customary guidance on the use of factors revolves on smallish data sizes (100,000 records) and counsels to use factors for attributes with cardinality (# of distinct values) in the hundreds or less. With the much larger sizes here, I can let the data do the talking.

The supporting platform is a Wintel 10 notebook with 128 GB RAM, along with software JupyterLab 1.2.4 and R 3.6.2. The R data.table, tidyverse, pryr, plyr, and knitr packages are featured, as well as functions from my personal stash, detailed below.

Set options, load libraries, and import personal functions. Most of these functions are used in this exercise and revolve on gernerating metadata and counts/frequencies of R data.tables.

In [1]:
options(warn=-1) 
options(scipen = 20)
options(datatable.print.topn=100)
options(datatable.showProgress=FALSE)
options(stringsAsFactors=TRUE)
usualsuspects <- c(
    'tidyverse', 'data.table', 'pryr', 'plyr','readxl', 'bit',     
    'rvest', 'magrittr','lubridate','rlist', 'tictoc', 'skimr',
    'fst','feather',
    'knitr', 'kableExtra',  
    'ggplot2','RColorBrewer'
) 
suppressMessages(invisible(lapply(usualsuspects, library, character.only = TRUE)))
funcsdir <- "/steve/r/functions"
funcsfile <- "rfunctions.r"
setwd(funcsdir)
source(funcsfile)
blanks(1)
lsf.str()
blanks(2)
allfreqs : function (dtn, catlim = 100)
blanks : function (howmany)  
colsize : function (dt)  
dtmeta : function (df)  
freqsdt : function (DTstr, xstr, percent = TRUE)  
meta : function (df, data = FALSE, dict = TRUE)  
mksumfreq : function (freqalldt)  
mksumfreq2 : function (dt)  
mykab : function (dt)  
obj_sz : function (obj)  
prhead : function (df, howmany = 6)
Set the working directory and identify files for loading. These files were previously downloaded from the CMS website.
In [2]:
dname <- "c:/bigdata/raw/medicare/cms"
setwd(dname)
fname <- c("OP_DTL_GNRL_PGYR2013_P01172020.csv","OP_DTL_GNRL_PGYR2014_P01172020.csv",
           "OP_DTL_GNRL_PGYR2015_P01172020.csv","OP_DTL_GNRL_PGYR2016_P01172020.csv",
           "OP_DTL_GNRL_PGYR2017_P01172020.csv","OP_DTL_GNRL_PGYR2018_P01172020.csv")
blanks(2)
Take a peek at 100 records from each file to determine variable names. The first three files have an identical 65 attributes; the last three have an identical 75 attributes. The union of columns from all files numbers 91, which, after adding one for the file identifier, represents the number of attributes in the aggregate data.table comprised below.
In [3]:
frdfnct <- function(f) tolower(names(fread(f,nrows=100)))
nmes <- lapply(fname,frdfnct)
lapply(nmes,length)
blanks(1)
setequal(nmes[[1]],nmes[[2]])
setequal(nmes[[1]],nmes[[3]])
setequal(nmes[[1]],nmes[[4]])
setequal(nmes[[4]],nmes[[5]])
setequal(nmes[[4]],nmes[[6]])
blanks(1)
length(nmes[[1]]) + length(setdiff(nmes[[4]],nmes[[1]]))
length(setdiff(nmes[[4]],nmes[[1]])) + length(setdiff(nmes[[1]],nmes[[4]])) + 
    length(intersect(nmes[[4]],nmes[[1]]))
blanks(2)
  1. 65
  2. 65
  3. 65
  4. 75
  5. 75
  6. 75
TRUE
TRUE
FALSE
TRUE
TRUE
91
91

Load the data into a data.table with character attributes stored as factors. Over 60M records with 92 attributes totaling 22 GB RAM. 81 of the attributes are factors. Much of the load cycle is devoted to updating blanks to NA’s. A column present in files 1-3 but not in 4-6 generates NA’s for the missing values. Much of what follows details metadata about the loaded data.table.

In [4]:
tic()
cmsgf <- rbindlist(lapply(fname, 
            fread, stringsAsFactors=TRUE,na.strings=""),use.names=TRUE, fill=TRUE, idcol="file") %>% 
            setnames(.,tolower(names(.)))
cmsgf[,c("date_of_payment","payment_publication_date"):=
      .(mdy(as.character(date_of_payment)),mdy(as.character(payment_publication_date)))]
dtmeta(cmsgf)
toc()
blanks(2)
|name |class                        |rows    |columns|size    |
|:----|:----------------------------|:-------|:------|:-------|
|cmsgf|c("data.table", "data.frame")|60658797|92     |22.04 GB|

|type     |N  |
|:--------|:--|
|integer  |7  |
|factor   |81 |
|integer64|1  |
|numeric  |1  |
|Date     |2  |
421.12 sec elapsed
Tabulate frequencies for each of the 92 attributes, storing the results in a single data.table, limiting the number of categories displayed to 5000/variable.
In [5]:
tic()
fcmsgf <- allfreqs("cmsgf",catlim=5000)
meta(fcmsgf,data=TRUE)
blanks(1)
toc()
blanks(2)
|name  |class                        |rows  |columns|size     |
|:-----|:----------------------------|:-----|:------|:--------|
|fcmsgf|c("data.table", "data.frame")|146059|4      |358.45 MB|

|vname|value|frequency|percent  |
|:----|:----|:--------|:--------|
|file |4    |11564445 |19.064745|
|file |3    |11554363 |19.048124|
|file |2    |11316275 |18.655621|
|file |5    |11235934 |18.523173|
|file |6    |10815629 |17.830273|
|file |1    |4172151  |6.878064 |

|vname                              |value       |frequency|percent |
|:----------------------------------|:-----------|:--------|:-------|
|associated_drug_or_biological_ndc_5|0234-0575-17|1        |0.000002|
|associated_drug_or_biological_ndc_5|0234-0575-16|1        |0.000002|
|associated_drug_or_biological_ndc_5|50419-171-03|1        |0.000002|
|associated_drug_or_biological_ndc_5|0008-0100-01|1        |0.000002|
|associated_drug_or_biological_ndc_5|58809-888-01|1        |0.000002|
|associated_drug_or_biological_ndc_5|0310-6225-60|1        |0.000002|
Classes 'data.table' and 'data.frame':	146059 obs. of  4 variables:
 $ vname    : chr  "file" "file" "file" "file" ...
 $ value    : Factor w/ 4127783 levels "1","2","3","4",..: 4 3 2 5 6 1 7 9 10 8 ...
 $ frequency: int  11564445 11554363 11316275 11235934 10815629 4172151 60614943 38051 5418 385 ...
 $ percent  : num  19.1 19 18.7 18.5 17.8 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL
72.34 sec elapsed
 
Summarize the frequencies — one record per attribute.
In [6]:
prhead(mksumfreq(fcmsgf))
blanks(2)
                    vname    N  sumfreq    sumpct I
1:                   file    6 60658797 100.00000 1
2:            change_type    4 60658797 100.00000 2
3: covered_recipient_type    2 60658797 100.00000 3
4:  teaching_hospital_ccn 1289 60658797 100.00000 4
5:   teaching_hospital_id 5000 60654422  99.99279 5
6: teaching_hospital_name 3817 60658797 100.00000 6
                                                       vname    N  sumfreq
1:                       associated_drug_or_biological_ndc_4  481 60658797
2:                         covered_or_noncovered_indicator_5    3 60658797
3: indicate_drug_or_biological_or_device_or_medical_supply_5    5 60658797
4:                    product_category_or_therapeutic_area_5  476 60658797
5:  name_of_drug_or_biological_or_device_or_medical_supply_5 2126 60658797
6:                       associated_drug_or_biological_ndc_5  278 60658797
   sumpct  I
1:    100 87
2:    100 88
3:    100 89
4:    100 90
5:    100 91
6:    100 92

List the first 6 attributes present in files 1-3 but not in 4-6. Then identify the first 6 attributes present in files 4-6 but not in 1-3.

In [7]:
head(setdiff(nmes[[1]],nmes[[4]]))
blanks(1)
head(setdiff(nmes[[4]],nmes[[1]]))
blanks(2)
  1. ‘product_indicator’
  2. ‘name_of_associated_covered_drug_or_biological1’
  3. ‘name_of_associated_covered_drug_or_biological2’
  4. ‘name_of_associated_covered_drug_or_biological3’
  5. ‘name_of_associated_covered_drug_or_biological4’
  6. ‘name_of_associated_covered_drug_or_biological5’
  1. ‘related_product_indicator’
  2. ‘covered_or_noncovered_indicator_1’
  3. ‘indicate_drug_or_biological_or_device_or_medical_supply_1’
  4. ‘product_category_or_therapeutic_area_1’
  5. ‘name_of_drug_or_biological_or_device_or_medical_supply_1’
  6. ‘associated_drug_or_biological_ndc_1’

Confirm NA’s by file for non-common columns using the flexible freqsdt function. Check.

In [8]:
mykab(freqsdt("cmsgf","file,product_indicator")[order(file)])
blanks(1)
mykab(freqsdt("cmsgf","file,related_product_indicator")[order(file)])
blanks(2)
|file|product_indicator|frequency|percent  |
|:---|:----------------|:--------|:--------|
|1   |Covered          |3827241  |6.309457 |
|1   |None             |184494   |0.304150 |
|1   |Non-Covered      |136125   |0.224411 |
|1   |Combination      |24291    |0.040045 |
|2   |Covered          |10380620 |17.113132|
|2   |None             |528144   |0.870680 |
|2   |Non-Covered      |357677   |0.589654 |
|2   |Combination      |49834    |0.082155 |
|3   |Covered          |10651317 |17.559394|
|3   |None             |531260   |0.875817 |
|3   |Non-Covered      |307937   |0.507654 |
|3   |Combination      |63849    |0.105259 |
|4   |NA               |11564445 |19.064745|
|5   |NA               |11235934 |18.523173|
|6   |NA               |10815629 |17.830273|
|file|related_product_indicator|frequency|percent  |
|:---|:------------------------|:--------|:--------|
|1   |NA                       |4172151  |6.878064 |
|2   |NA                       |11316275 |18.655621|
|3   |NA                       |11554363 |19.048124|
|4   |Yes                      |10986341 |18.111703|
|4   |No                       |578104   |0.953042 |
|5   |Yes                      |10698759 |17.637605|
|5   |No                       |537175   |0.885568 |
|6   |Yes                      |10283818 |16.953548|
|6   |No                       |531811   |0.876725 |
Now conduct the same loading exercise storing the attributes as character rather than factor. Note that elapsed load time is more than double the previous for factor, and the RAM demands, at 40 GB, are almost double.
In [9]:
tic()
cmsgs <- rbindlist(lapply(fname, 
            fread, stringsAsFactors=FALSE,na.strings=""),use.names=TRUE, fill=TRUE, idcol="file") %>% 
            setnames(.,tolower(names(.)))
cmsgs[,c("date_of_payment","payment_publication_date"):=
      .(mdy(date_of_payment),mdy(payment_publication_date))]
dtmeta(cmsgs)
toc()
blanks(2)
|name |class                        |rows    |columns|size    |
|:----|:----------------------------|:-------|:------|:-------|
|cmsgs|c("data.table", "data.frame")|60658797|92     |40.31 GB|

|type     |N  |
|:--------|:--|
|integer  |7  |
|character|81 |
|integer64|1  |
|numeric  |1  |
|Date     |2  |
999.27 sec elapsed
 

Tally frequencies for each of the 92 attributes in the character strings version of the data.table. The elapsed time for this operation is double that of the factor version.

In [10]:
tic()
fcmsgs <- allfreqs("cmsgs",catlim=5000)
meta(fcmsgs,data=TRUE)
blanks(1)
toc()
blanks(2)
|name  |class                        |rows  |columns|size    |
|:-----|:----------------------------|:-----|:------|:-------|
|fcmsgs|c("data.table", "data.frame")|146059|4      |10.92 MB|

|vname|value|frequency|percent  |
|:----|:----|:--------|:--------|
|file |4    |11564445 |19.064745|
|file |3    |11554363 |19.048124|
|file |2    |11316275 |18.655621|
|file |5    |11235934 |18.523173|
|file |6    |10815629 |17.830273|
|file |1    |4172151  |6.878064 |

|vname                              |value       |frequency|percent |
|:----------------------------------|:-----------|:--------|:-------|
|associated_drug_or_biological_ndc_5|0234-0575-17|1        |0.000002|
|associated_drug_or_biological_ndc_5|0234-0575-16|1        |0.000002|
|associated_drug_or_biological_ndc_5|50419-171-03|1        |0.000002|
|associated_drug_or_biological_ndc_5|0008-0100-01|1        |0.000002|
|associated_drug_or_biological_ndc_5|58809-888-01|1        |0.000002|
|associated_drug_or_biological_ndc_5|0310-6225-60|1        |0.000002|
Classes 'data.table' and 'data.frame':	146059 obs. of  4 variables:
 $ vname    : chr  "file" "file" "file" "file" ...
 $ value    : Factor w/ 103193 levels "1","2","3","4",..: 4 3 2 5 6 1 10 8 9 7 ...
 $ frequency: int  11564445 11554363 11316275 11235934 10815629 4172151 60614943 38051 5418 385 ...
 $ percent  : num  19.1 19 18.7 18.5 17.8 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL
160.92 sec elapsed
 
Summarize the frequencies by attribute name.
In [11]:
prhead(mksumfreq(fcmsgs))
blanks(2)
                    vname    N  sumfreq    sumpct I
1:                   file    6 60658797 100.00000 1
2:            change_type    4 60658797 100.00000 2
3: covered_recipient_type    2 60658797 100.00000 3
4:  teaching_hospital_ccn 1289 60658797 100.00000 4
5:   teaching_hospital_id 5000 60654422  99.99279 5
6: teaching_hospital_name 3817 60658797 100.00000 6
                                                       vname    N  sumfreq
1:                       associated_drug_or_biological_ndc_4  481 60658797
2:                         covered_or_noncovered_indicator_5    3 60658797
3: indicate_drug_or_biological_or_device_or_medical_supply_5    5 60658797
4:                    product_category_or_therapeutic_area_5  476 60658797
5:  name_of_drug_or_biological_or_device_or_medical_supply_5 2126 60658797
6:                       associated_drug_or_biological_ndc_5  278 60658797
   sumpct  I
1:    100 87
2:    100 88
3:    100 89
4:    100 90
5:    100 91
6:    100 92

 
Next, write a function to denote the storage allocations for each attribute of a data.table. This will allow us to compare the size of character attributes against the corresponding factors on an attribute basis. colsize deploys a functional coding metaphor.
In [12]:
colsize <- function(dt)
{
    f <- function(nm) data.table(cname=nm,type=class(dt[[nm]]),cats=length(unique(dt[[nm]])),sz=obj_sz(dt[[nm]]),osz=object_size(dt[[nm]]))
    map_df(names(dt), f)[,c("cumosz","I"):=.(cumsum(osz),.I)]  
    
}
blanks(2)

First up, the factor data.table version. For each column, there’s the name, data type, # distinct categories, size, size in bytes, and cumulative size in bytes.

In [13]:
tic()
colszf = colsize(cmsgf)
setnames(colszf,paste0(names(colszf),"_f"))
prhead(colszf)
toc()
blanks(2)
                  cname_f  type_f cats_f      sz_f     osz_f   cumosz_f I_f
1:                   file integer      6 231.39 MB 242635240  242635240   1
2:            change_type  factor      4  231.4 MB 242635888  485271128   2
3: covered_recipient_type  factor      2  231.4 MB 242635816  727906944   3
4:  teaching_hospital_ccn integer   1289 231.39 MB 242635240  970542184   4
5:   teaching_hospital_id integer   6679 231.39 MB 242635240 1213177424   5
6: teaching_hospital_name  factor   3817 231.73 MB 242991336 1456168760   6
                                                     cname_f type_f cats_f
1:                       associated_drug_or_biological_ndc_4 factor    481
2:                         covered_or_noncovered_indicator_5 factor      3
3: indicate_drug_or_biological_or_device_or_medical_supply_5 factor      5
4:                    product_category_or_therapeutic_area_5 factor    476
5:  name_of_drug_or_biological_or_device_or_medical_supply_5 factor   2126
6:                       associated_drug_or_biological_ndc_5 factor    278
        sz_f     osz_f    cumosz_f I_f
1: 231.43 MB 242670184 22465400912  87
2:  231.4 MB 242635760 22708036672  88
3:  231.4 MB 242635896 22950672568  89
4: 231.43 MB 242673936 23193346504  90
5: 231.56 MB 242805336 23436151840  91
6: 231.41 MB 242655568 23678807408  92
84.96 sec elapsed
 
Now the character attribute version. Note the order of magnitude difference in execution times.
In [15]:
tic()
colszs = colsize(cmsgs)
setnames(colszs,paste0(names(colszs),"_s"))
prhead(colszs)
toc()
blanks(2)
                  cname_s    type_s cats_s      sz_s     osz_s   cumosz_s I_s
1:                   file   integer      6 231.39 MB 242635240  242635240   1
2:            change_type character      4 462.79 MB 485270656  727905896   2
3: covered_recipient_type character      2 462.79 MB 485270600 1213176496   3
4:  teaching_hospital_ccn   integer   1289 231.39 MB 242635240 1455811736   4
5:   teaching_hospital_id   integer   6679 231.39 MB 242635240 1698446976   5
6: teaching_hospital_name character   3817  463.1 MB 485595664 2184042640   6
                                                     cname_s    type_s cats_s
1:                       associated_drug_or_biological_ndc_4 character    481
2:                         covered_or_noncovered_indicator_5 character      3
3: indicate_drug_or_biological_or_device_or_medical_supply_5 character      5
4:                    product_category_or_therapeutic_area_5 character    476
5:  name_of_drug_or_biological_or_device_or_medical_supply_5 character   2126
6:                       associated_drug_or_biological_ndc_5 character    278
        sz_s     osz_s    cumosz_s I_s
1: 462.82 MB 485301200 40871210024  87
2: 462.79 MB 485270600 41356480624  88
3: 462.79 MB 485270720 41841751344  89
4: 462.82 MB 485304992 42327056336  90
5: 462.94 MB 485423192 42812479528  91
6: 462.81 MB 485288208 43297767736  92
1498.73 sec elapsed
 
Join the two data.tables and display. What’s striking is the difference in storage between character and factor, with a marked advantage for factor — even among attributes that assume tens or hundreds of thousands of distinct values. The difference in memory between the character data.table version (40 GB) and the factor version (22 GB) is striking.
In [16]:
colszall <- inner_join(colszf,colszs,by=c("cname_f"="cname_s")) %>% data.table   
meta(colszall)
blanks(2)
|name    |class                        |rows|columns|size    |
|:-------|:----------------------------|:---|:------|:-------|
|colszall|c("data.table", "data.frame")|92  |13     |23.26 KB|
Classes 'data.table' and 'data.frame':	92 obs. of  13 variables:
 $ cname_f : chr  "file" "change_type" "covered_recipient_type" "teaching_hospital_ccn" ...
 $ type_f  : chr  "integer" "factor" "factor" "integer" ...
 $ cats_f  : int  6 4 2 1289 6679 3817 1032290 120912 108105 378249 ...
 $ sz_f    : chr  "231.39 MB" "231.4 MB" "231.4 MB" "231.39 MB" ...
 $ osz_f   : num  242635240 242635888 242635816 242635240 242635240 ...
 $ cumosz_f: num  242635240 485271128 727906944 970542184 1213177424 ...
 $ I_f     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ type_s  : chr  "integer" "character" "character" "integer" ...
 $ cats_s  : int  6 4 2 1289 6679 3817 1032290 120912 108105 378249 ...
 $ sz_s    : chr  "231.39 MB" "462.79 MB" "462.79 MB" "231.39 MB" ...
 $ osz_s   : num  242635240 485270656 485270600 242635240 242635240 ...
 $ cumosz_s: num  242635240 727905896 1213176496 1455811736 1698446976 ...
 $ I_s     : int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr> 
NULL
 
In [17]:
print(mykab(colszall[,.(I_f,cname_f,cats_f,type_f,sz_f,type_s,sz_s)]))
blanks(2)
|I_f|cname_f                                                          |cats_f  |type_f   |sz_f     |type_s   |sz_s     |
|:--|:----------------------------------------------------------------|:-------|:--------|:--------|:--------|:--------|
|1  |file                                                             |6       |integer  |231.39 MB|integer  |231.39 MB|
|2  |change_type                                                      |4       |factor   |231.4 MB |character|462.79 MB|
|3  |covered_recipient_type                                           |2       |factor   |231.4 MB |character|462.79 MB|
|4  |teaching_hospital_ccn                                            |1289    |integer  |231.39 MB|integer  |231.39 MB|
|5  |teaching_hospital_id                                             |6679    |integer  |231.39 MB|integer  |231.39 MB|
|6  |teaching_hospital_name                                           |3817    |factor   |231.73 MB|character|463.1 MB |
|7  |physician_profile_id                                             |1032290 |integer  |231.39 MB|integer  |231.39 MB|
|8  |physician_first_name                                             |120912  |factor   |239.11 MB|character|469.58 MB|
|9  |physician_middle_name                                            |108105  |factor   |238.32 MB|character|468.89 MB|
|10 |physician_last_name                                              |378249  |factor   |256.02 MB|character|484.53 MB|
|11 |physician_name_suffix                                            |271     |factor   |231.41 MB|character|462.8 MB |
|12 |recipient_primary_business_street_address_line1                  |1536978 |factor   |355.83 MB|character|575.5 MB |
|13 |recipient_primary_business_street_address_line2                  |271707  |factor   |252.64 MB|character|481.97 MB|
|14 |recipient_city                                                   |35417   |factor   |233.79 MB|character|464.91 MB|
|15 |recipient_state                                                  |65      |factor   |231.4 MB |character|462.79 MB|
|16 |recipient_zip_code                                               |366610  |factor   |256.37 MB|character|484.96 MB|
|17 |recipient_country                                                |53      |factor   |231.4 MB |character|462.79 MB|
|18 |recipient_province                                               |169     |factor   |231.41 MB|character|462.8 MB |
|19 |recipient_postal_code                                            |443     |factor   |231.42 MB|character|462.81 MB|
|20 |physician_primary_type                                           |7       |factor   |231.4 MB |character|462.79 MB|
|21 |physician_specialty                                              |563     |factor   |231.48 MB|character|462.87 MB|
|22 |physician_license_state_code1                                    |61      |factor   |231.4 MB |character|462.79 MB|
|23 |physician_license_state_code2                                    |59      |factor   |231.4 MB |character|462.79 MB|
|24 |physician_license_state_code3                                    |57      |factor   |231.4 MB |character|462.79 MB|
|25 |physician_license_state_code4                                    |53      |factor   |231.4 MB |character|462.79 MB|
|26 |physician_license_state_code5                                    |48      |factor   |231.4 MB |character|462.79 MB|
|27 |submitting_applicable_manufacturer_or_applicable_gpo_name        |2513    |factor   |231.6 MB |character|462.98 MB|
|28 |applicable_manufacturer_or_applicable_gpo_making_payment_id      |2390    |integer64|462.79 MB|integer64|462.79 MB|
|29 |applicable_manufacturer_or_applicable_gpo_making_payment_name    |2705    |factor   |231.62 MB|character|462.99 MB|
|30 |applicable_manufacturer_or_applicable_gpo_making_payment_state   |51      |factor   |231.4 MB |character|462.79 MB|
|31 |applicable_manufacturer_or_applicable_gpo_making_payment_country |51      |factor   |231.4 MB |character|462.79 MB|
|32 |total_amount_of_payment_usdollars                                |270670  |numeric  |462.79 MB|numeric  |462.79 MB|
|33 |date_of_payment                                                  |1995    |Date     |462.79 MB|Date     |462.79 MB|
|34 |number_of_payments_included_in_total_amount                      |163     |integer  |231.39 MB|integer  |231.39 MB|
|35 |form_of_payment_or_transfer_of_value                             |7       |factor   |231.4 MB |character|462.79 MB|
|36 |nature_of_payment_or_transfer_of_value                           |15      |factor   |231.4 MB |character|462.79 MB|
|37 |city_of_travel                                                   |25492   |factor   |233.12 MB|character|464.32 MB|
|38 |state_of_travel                                                  |62      |factor   |231.4 MB |character|462.79 MB|
|39 |country_of_travel                                                |158     |factor   |231.41 MB|character|462.8 MB |
|40 |physician_ownership_indicator                                    |3       |factor   |231.4 MB |character|462.79 MB|
|41 |third_party_payment_recipient_indicator                          |3       |factor   |231.4 MB |character|462.79 MB|
|42 |name_of_third_party_entity_receiving_payment_or_transfer_of_value|54991   |factor   |236.04 MB|character|467.01 MB|
|43 |charity_indicator                                                |3       |factor   |231.4 MB |character|462.79 MB|
|44 |third_party_equals_covered_recipient_indicator                   |3       |factor   |231.4 MB |character|462.79 MB|
|45 |contextual_information                                           |1310350 |factor   |373.01 MB|character|594.4 MB |
|46 |delay_in_publication_indicator                                   |1       |factor   |231.4 MB |character|462.79 MB|
|47 |record_id                                                        |60658797|integer  |231.39 MB|integer  |231.39 MB|
|48 |dispute_status_for_publication                                   |2       |factor   |231.4 MB |character|462.79 MB|
|49 |product_indicator                                                |5       |factor   |231.4 MB |character|462.79 MB|
|50 |name_of_associated_covered_drug_or_biological1                   |4284    |factor   |231.73 MB|character|463.09 MB|
|51 |name_of_associated_covered_drug_or_biological2                   |1514    |factor   |231.5 MB |character|462.88 MB|
|52 |name_of_associated_covered_drug_or_biological3                   |1044    |factor   |231.47 MB|character|462.86 MB|
|53 |name_of_associated_covered_drug_or_biological4                   |633     |factor   |231.44 MB|character|462.83 MB|
|54 |name_of_associated_covered_drug_or_biological5                   |363     |factor   |231.42 MB|character|462.81 MB|
|55 |ndc_of_associated_covered_drug_or_biological1                    |7626    |factor   |231.92 MB|character|463.26 MB|
|56 |ndc_of_associated_covered_drug_or_biological2                    |1237    |factor   |231.48 MB|character|462.87 MB|
|57 |ndc_of_associated_covered_drug_or_biological3                    |902     |factor   |231.46 MB|character|462.84 MB|
|58 |ndc_of_associated_covered_drug_or_biological4                    |574     |factor   |231.43 MB|character|462.82 MB|
|59 |ndc_of_associated_covered_drug_or_biological5                    |322     |factor   |231.42 MB|character|462.81 MB|
|60 |name_of_associated_covered_device_or_medical_supply1             |11113   |factor   |232.31 MB|character|463.62 MB|
|61 |name_of_associated_covered_device_or_medical_supply2             |2728    |factor   |231.6 MB |character|462.98 MB|
|62 |name_of_associated_covered_device_or_medical_supply3             |1692    |factor   |231.52 MB|character|462.91 MB|
|63 |name_of_associated_covered_device_or_medical_supply4             |996     |factor   |231.47 MB|character|462.86 MB|
|64 |name_of_associated_covered_device_or_medical_supply5             |545     |factor   |231.44 MB|character|462.83 MB|
|65 |program_year                                                     |6       |integer  |231.39 MB|integer  |231.39 MB|
|66 |payment_publication_date                                         |1       |Date     |462.79 MB|Date     |462.79 MB|
|67 |related_product_indicator                                        |3       |factor   |231.4 MB |character|462.79 MB|
|68 |covered_or_noncovered_indicator_1                                |3       |factor   |231.4 MB |character|462.79 MB|
|69 |indicate_drug_or_biological_or_device_or_medical_supply_1        |5       |factor   |231.4 MB |character|462.79 MB|
|70 |product_category_or_therapeutic_area_1                           |3758    |factor   |231.69 MB|character|463.05 MB|
|71 |name_of_drug_or_biological_or_device_or_medical_supply_1         |22888   |factor   |233.2 MB |character|464.42 MB|
|72 |associated_drug_or_biological_ndc_1                              |2107    |factor   |231.54 MB|character|462.92 MB|
|73 |covered_or_noncovered_indicator_2                                |3       |factor   |231.4 MB |character|462.79 MB|
|74 |indicate_drug_or_biological_or_device_or_medical_supply_2        |5       |factor   |231.4 MB |character|462.79 MB|
|75 |product_category_or_therapeutic_area_2                           |1166    |factor   |231.49 MB|character|462.87 MB|
|76 |name_of_drug_or_biological_or_device_or_medical_supply_2         |7009    |factor   |231.92 MB|character|463.26 MB|
|77 |associated_drug_or_biological_ndc_2                              |1097    |factor   |231.47 MB|character|462.86 MB|
|78 |covered_or_noncovered_indicator_3                                |3       |factor   |231.4 MB |character|462.79 MB|
|79 |indicate_drug_or_biological_or_device_or_medical_supply_3        |5       |factor   |231.4 MB |character|462.79 MB|
|80 |product_category_or_therapeutic_area_3                           |872     |factor   |231.46 MB|character|462.85 MB|
|81 |name_of_drug_or_biological_or_device_or_medical_supply_3         |5201    |factor   |231.78 MB|character|463.14 MB|
|82 |associated_drug_or_biological_ndc_3                              |791     |factor   |231.45 MB|character|462.84 MB|
|83 |covered_or_noncovered_indicator_4                                |3       |factor   |231.4 MB |character|462.79 MB|
|84 |indicate_drug_or_biological_or_device_or_medical_supply_4        |5       |factor   |231.4 MB |character|462.79 MB|
|85 |product_category_or_therapeutic_area_4                           |631     |factor   |231.44 MB|character|462.83 MB|
|86 |name_of_drug_or_biological_or_device_or_medical_supply_4         |3295    |factor   |231.64 MB|character|463.01 MB|
|87 |associated_drug_or_biological_ndc_4                              |481     |factor   |231.43 MB|character|462.82 MB|
|88 |covered_or_noncovered_indicator_5                                |3       |factor   |231.4 MB |character|462.79 MB|
|89 |indicate_drug_or_biological_or_device_or_medical_supply_5        |5       |factor   |231.4 MB |character|462.79 MB|
|90 |product_category_or_therapeutic_area_5                           |476     |factor   |231.43 MB|character|462.82 MB|
|91 |name_of_drug_or_biological_or_device_or_medical_supply_5         |2126    |factor   |231.56 MB|character|462.94 MB|
|92 |associated_drug_or_biological_ndc_5                              |278     |factor   |231.41 MB|character|462.81 MB|
 

The findings above strongly suggest that for handling large data.tables in R factors should be preferred to character for text attributes. Not only is there a significant storage savings — even with high cardinality columns — but likely a performance benefit as well. Add to that the flexibility of different orderings for factors and the case becomes quite compelling.

In [ ]:
Steve Miller

Steve Miller

Steve is a data scientist focusing on research methods, data integration, statistics/ML, and computational analytics in R/python. Follow him on LinkedIn: https://www.linkedin.com/in/steve-miller-58ab881/

1