

A Data Pattern with an R data.table Solution.
ModelingRTools & Languagesposted by Steve Miller April 17, 2020 Steve Miller

Summary: This blog examines a loading pattern seen often with government-generated, web-accessible data. The data comprise millions of records across multiple text or csv files, generally demarcated by time. The files may present different, but overlapping, attributes, while much of the data has a character representation, posing the challenge of how to optimally manifest it in R (or Python-Pandas) — either as character or “factor”. In the analysis that follows, I consider an R data.table example. The to-dos are to load files that don’t all possess the same attributes, and to optimize storage for character variables. The main finding is that factor representations of character variables should probably be preferred in cases like this one. Those with an intermediate level of R programming skill will best benefit from the code that follows.
[More from the writer: Creating if/elseif/else Variables in Python/Pandas]
I’m working currently with a large data set from the Center for Medicare and Medicaid Services (CMS). A few years ago, I analyzed other CMS data that presented a similar pattern:
1) Multiple csv/text files delineated by time.
2) Preponderance of character/string data types, along with a smattering of dates, booleans, and numerics.
3) The various csv/text files may or may not have identical attributes, though there’s much commonality.
4) The data from the union of all columns is to be maintained, with NA’s assigned for missings.
5) The combined data is often quite large by notebook standards — in excess of 5 GB.
6) And since R/Python-Pandas are memory intensive, it’s critical to optimize storage demands.
The challenge I take on below is to efficiently load and store the data from the six large open payments text files into a single R data.table, accommodating an overlapping but not identical attribute structure. In addition, I contrast storing character data attributes as either character or factor data types.
The factor is an R data type used to represent categories of data. The categories (levels) themselves may be any data type but are most often character or text. Factor values on the other hand are denoted by a vector of integers that “point” to the categories. An R example would be variable gender, consisting of levels c(“Female”,”Male”), represented by values c(1,2). One obvious advantage of factors over text data is that the labels are stored just once, potentially saving storage. A second advantage is that factor ordering is flexible based on how they’re constructed.
Customary guidance on the use of factors revolves on smallish data sizes (100,000 records) and counsels to use factors for attributes with cardinality (# of distinct values) in the hundreds or less. With the much larger sizes here, I can let the data do the talking.
The supporting platform is a Wintel 10 notebook with 128 GB RAM, along with software JupyterLab 1.2.4 and R 3.6.2. The R data.table, tidyverse, pryr, plyr, and knitr packages are featured, as well as functions from my personal stash, detailed below.
Set options, load libraries, and import personal functions. Most of these functions are used in this exercise and revolve on gernerating metadata and counts/frequencies of R data.tables.
options(warn=-1)
options(scipen = 20)
options(datatable.print.topn=100)
options(datatable.showProgress=FALSE)
options(stringsAsFactors=TRUE)
usualsuspects <- c(
'tidyverse', 'data.table', 'pryr', 'plyr','readxl', 'bit',
'rvest', 'magrittr','lubridate','rlist', 'tictoc', 'skimr',
'fst','feather',
'knitr', 'kableExtra',
'ggplot2','RColorBrewer'
)
suppressMessages(invisible(lapply(usualsuspects, library, character.only = TRUE)))
funcsdir <- "/steve/r/functions"
funcsfile <- "rfunctions.r"
setwd(funcsdir)
source(funcsfile)
blanks(1)
lsf.str()
blanks(2)
dname <- "c:/bigdata/raw/medicare/cms"
setwd(dname)
fname <- c("OP_DTL_GNRL_PGYR2013_P01172020.csv","OP_DTL_GNRL_PGYR2014_P01172020.csv",
"OP_DTL_GNRL_PGYR2015_P01172020.csv","OP_DTL_GNRL_PGYR2016_P01172020.csv",
"OP_DTL_GNRL_PGYR2017_P01172020.csv","OP_DTL_GNRL_PGYR2018_P01172020.csv")
blanks(2)
frdfnct <- function(f) tolower(names(fread(f,nrows=100)))
nmes <- lapply(fname,frdfnct)
lapply(nmes,length)
blanks(1)
setequal(nmes[[1]],nmes[[2]])
setequal(nmes[[1]],nmes[[3]])
setequal(nmes[[1]],nmes[[4]])
setequal(nmes[[4]],nmes[[5]])
setequal(nmes[[4]],nmes[[6]])
blanks(1)
length(nmes[[1]]) + length(setdiff(nmes[[4]],nmes[[1]]))
length(setdiff(nmes[[4]],nmes[[1]])) + length(setdiff(nmes[[1]],nmes[[4]])) +
length(intersect(nmes[[4]],nmes[[1]]))
blanks(2)
Load the data into a data.table with character attributes stored as factors. Over 60M records with 92 attributes totaling 22 GB RAM. 81 of the attributes are factors. Much of the load cycle is devoted to updating blanks to NA’s. A column present in files 1-3 but not in 4-6 generates NA’s for the missing values. Much of what follows details metadata about the loaded data.table.
tic()
cmsgf <- rbindlist(lapply(fname,
fread, stringsAsFactors=TRUE,na.strings=""),use.names=TRUE, fill=TRUE, idcol="file") %>%
setnames(.,tolower(names(.)))
cmsgf[,c("date_of_payment","payment_publication_date"):=
.(mdy(as.character(date_of_payment)),mdy(as.character(payment_publication_date)))]
dtmeta(cmsgf)
toc()
blanks(2)
tic()
fcmsgf <- allfreqs("cmsgf",catlim=5000)
meta(fcmsgf,data=TRUE)
blanks(1)
toc()
blanks(2)
prhead(mksumfreq(fcmsgf))
blanks(2)
List the first 6 attributes present in files 1-3 but not in 4-6. Then identify the first 6 attributes present in files 4-6 but not in 1-3.
head(setdiff(nmes[[1]],nmes[[4]]))
blanks(1)
head(setdiff(nmes[[4]],nmes[[1]]))
blanks(2)
Confirm NA’s by file for non-common columns using the flexible freqsdt function. Check.
mykab(freqsdt("cmsgf","file,product_indicator")[order(file)])
blanks(1)
mykab(freqsdt("cmsgf","file,related_product_indicator")[order(file)])
blanks(2)
tic()
cmsgs <- rbindlist(lapply(fname,
fread, stringsAsFactors=FALSE,na.strings=""),use.names=TRUE, fill=TRUE, idcol="file") %>%
setnames(.,tolower(names(.)))
cmsgs[,c("date_of_payment","payment_publication_date"):=
.(mdy(date_of_payment),mdy(payment_publication_date))]
dtmeta(cmsgs)
toc()
blanks(2)
Tally frequencies for each of the 92 attributes in the character strings version of the data.table. The elapsed time for this operation is double that of the factor version.
tic()
fcmsgs <- allfreqs("cmsgs",catlim=5000)
meta(fcmsgs,data=TRUE)
blanks(1)
toc()
blanks(2)
prhead(mksumfreq(fcmsgs))
blanks(2)
colsize <- function(dt)
{
f <- function(nm) data.table(cname=nm,type=class(dt[[nm]]),cats=length(unique(dt[[nm]])),sz=obj_sz(dt[[nm]]),osz=object_size(dt[[nm]]))
map_df(names(dt), f)[,c("cumosz","I"):=.(cumsum(osz),.I)]
}
blanks(2)
First up, the factor data.table version. For each column, there’s the name, data type, # distinct categories, size, size in bytes, and cumulative size in bytes.
tic()
colszf = colsize(cmsgf)
setnames(colszf,paste0(names(colszf),"_f"))
prhead(colszf)
toc()
blanks(2)
tic()
colszs = colsize(cmsgs)
setnames(colszs,paste0(names(colszs),"_s"))
prhead(colszs)
toc()
blanks(2)
colszall <- inner_join(colszf,colszs,by=c("cname_f"="cname_s")) %>% data.table
meta(colszall)
blanks(2)
print(mykab(colszall[,.(I_f,cname_f,cats_f,type_f,sz_f,type_s,sz_s)]))
blanks(2)
The findings above strongly suggest that for handling large data.tables in R factors should be preferred to character for text attributes. Not only is there a significant storage savings — even with high cardinality columns — but likely a performance benefit as well. Add to that the flexibility of different orderings for factors and the case becomes quite compelling.