fbpx
Julia for Management/Analysis of Johns Hopkins COVID Data Julia for Management/Analysis of Johns Hopkins COVID Data
Like many analytics geeks, I’ve been tracking data on the Covid pandemic since early spring. My source is the Center for Systems... Julia for Management/Analysis of Johns Hopkins COVID Data

Like many analytics geeks, I’ve been tracking data on the Covid pandemic since early spring. My source is the Center for Systems Science and Engineering at Johns Hopkins University, with files for download made available at midnight Central time. I’ve established a pretty significant R infrastructure in JupyterLab to summarize daily developments. The combination of R data.table and tidyverse programming capabilities has made the computation straightforward.

There are a few quirks with the data, though, that make it a suitable challenge for less mature data management platforms. The Hopkins data consists of two CSV files, the first detailing COVID cases by date in the U.S., the second U.S. Covid fatalities. Both files are organized similarly. The granularity is sub-state, generally representing counties or other geographic administrations. Each day’s data adds a new column depicting cumulative cases/fatalities for the day. The downloaded data must first be pivoted or melted to transform the date columns into rows. split/apply/combine computations within the administrative geography key, uid, then produce daily cases and fatalities. The cases and fatalities dataframes are finally joined to produce a final that’s the source of analysis.

For this blog, I investigate the handling of my Covid data in Julia, using its dataframe package primarily. In subsequent posts, I’ll consider Python/Pandas and the new Python datatable. The focus of this exercise is data analysis, consisting of basic wrangling plus the display of computations — both in list and graphical modes. Heavier statistical/ML exercises will come in future blogs.

The code that follows first reads and munges daily cases and fatalities CSV files downloaded from the Hopkins Covid data portal, computing additional attributes such as running sums and moving averages. In the end, I display stats with both listings and graphics.

The results of this inquiry are encouraging. Indeed, I ran into a few snarls in the programming journey. I found Julia 1.5.2 much easier to work with than its predecessors, the result of both functionality enhancement/stabilization and documentation availability — especially tips in StackOverFlow. That’s a welcome sign. The performance of Julia is acceptable, actually better than I’ve experienced so far, though still lagging R and Python on JupyterLab. At least part of this deficit I suspect is that Windows and Jupyter are not ideal Julia staging platforms. In addition, I believe Julia shines more in intensive algorithmic computation than it does in data management/analysis.

[Related article: Jupyter Notebook: Python or R—Or Both?]

The software used below is JupyterLab 2.1.2, Julia 1.5.2, Python 3.7.5, and R 4.0.2. I’ll continue investigating Julia for data analysis next time, with a special focus then on divining interoperability with R and Python.

In [1]:
############################################################################################################
############################################################################################################
Define a few simple timing/printing macros/functions.
In [2]:
macro timend()
    stop = time()
    elapsed = stop-start
    println(elapsed)
end
println();println()
In [3]:
function pr(howmany)   
    for i = 1:howmany
        println()
    end   
end
pr(2)
In [4]:
function showln(obj)  
    show(obj)
    pr(2)
end
pr(2)

Load pertinent packages.

In [5]:
using DataFrames
using PyCall
using RCall
using RDatasets
using CSV
using DataFramesMeta
using Base
using Dates
using Feather
using Gadfly
using DataTables
using RollingFunctions
using Printf
pr(2)
[1] "C:/R/4.0.2/library"                 "C:/Program Files/R/R-4.0.2/library"


Define dataframe meta data and frequencies functions, plus a few convenience ones.

In [6]:
function metaj(df;data=false)
    
    println(DataFrames.typeof(df))
    println(DataFrames.size(df))
    println(hcat(names(df),eltypes(df)))
    if data
        print(DataFrames.first(df,6))
        pr(1)
        print(DataFrames.last(df,6))
    end
    pr(2)
    
end
pr(2)
In [7]:
function frequencies(df,vars)
    freqs = combine(DataFrames.groupby(df, vars), nrow)
    freqs = DataFrames.rename(freqs, :nrow => :count);
    freqs[!,:percent] = 100*freqs.count/sum(freqs.count);
    sort!(freqs, [DataFrames.order(:count, rev = true)]);
    
    return(freqs)
    
end
pr(2)
In [8]:
function sumna(vec) return(sum(skipmissing(vec))) end
pr(2)
In [9]:
function mknames(df)
    from = names(df);
    to = [Symbol(replace(lowercase(string(f))," " => "")) for f in from];
    to = [Symbol(replace(lowercase(string(t)),"_" => "")) for t in to];
    return(to)
end
pr(2)

Download the latest U.S. Johns Hopkins COVID Data. Note the repeating group structure of daily date which holds cumulative cases for geography identified by key UID.

In [10]:
start = time()
burl = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/";
fname = "csse_covid_19_time_series/time_series_covid19_confirmed_US.csv";
covidcases = CSV.read(download(string(burl,fname)));
metaj(covidcases)
@timend
pr(2)
DataFrame
(3340, 296)
Any["UID" Int64; "iso2" String; "iso3" String; "code3" Int64; "FIPS" Union{Missing, Float64}; "Admin2" Union{Missing, String}; "Province_State" String; "Country_Region" String; "Lat" Float64; "Long_" Float64; "Combined_Key" String; "1/22/20" Int64; "1/23/20" Int64; "1/24/20" Int64; "1/25/20" Int64; "1/26/20" Int64; "1/27/20" Int64; "1/28/20" Int64; "1/29/20" Int64; "1/30/20" Int64; "1/31/20" Int64; "2/1/20" Int64; "2/2/20" Int64; "2/3/20" Int64; "2/4/20" Int64; "2/5/20" Int64; "2/6/20" Int64; "2/7/20" Int64; "2/8/20" Int64; "2/9/20" Int64; "2/10/20" Int64; "2/11/20" Int64; "2/12/20" Int64; "2/13/20" Int64; "2/14/20" Int64; "2/15/20" Int64; "2/16/20" Int64; "2/17/20" Int64; "2/18/20" Int64; "2/19/20" Int64; "2/20/20" Int64; "2/21/20" Int64; "2/22/20" Int64; "2/23/20" Int64; "2/24/20" Int64; "2/25/20" Int64; "2/26/20" Int64; "2/27/20" Int64; "2/28/20" Int64; "2/29/20" Int64; "3/1/20" Int64; "3/2/20" Int64; "3/3/20" Int64; "3/4/20" Int64; "3/5/20" Int64; "3/6/20" Int64; "3/7/20" Int64; "3/8/20" Int64; "3/9/20" Int64; "3/10/20" Int64; "3/11/20" Int64; "3/12/20" Int64; "3/13/20" Int64; "3/14/20" Int64; "3/15/20" Int64; "3/16/20" Int64; "3/17/20" Int64; "3/18/20" Int64; "3/19/20" Int64; "3/20/20" Int64; "3/21/20" Int64; "3/22/20" Int64; "3/23/20" Int64; "3/24/20" Int64; "3/25/20" Int64; "3/26/20" Int64; "3/27/20" Int64; "3/28/20" Int64; "3/29/20" Int64; "3/30/20" Int64; "3/31/20" Int64; "4/1/20" Int64; "4/2/20" Int64; "4/3/20" Int64; "4/4/20" Int64; "4/5/20" Int64; "4/6/20" Int64; "4/7/20" Int64; "4/8/20" Int64; "4/9/20" Int64; "4/10/20" Int64; "4/11/20" Int64; "4/12/20" Int64; "4/13/20" Int64; "4/14/20" Int64; "4/15/20" Int64; "4/16/20" Int64; "4/17/20" Int64; "4/18/20" Int64; "4/19/20" Int64; "4/20/20" Int64; "4/21/20" Int64; "4/22/20" Int64; "4/23/20" Int64; "4/24/20" Int64; "4/25/20" Int64; "4/26/20" Int64; "4/27/20" Int64; "4/28/20" Int64; "4/29/20" Int64; "4/30/20" Int64; "5/1/20" Int64; "5/2/20" Int64; "5/3/20" Int64; "5/4/20" Int64; "5/5/20" Int64; "5/6/20" Int64; "5/7/20" Int64; "5/8/20" Int64; "5/9/20" Int64; "5/10/20" Int64; "5/11/20" Int64; "5/12/20" Int64; "5/13/20" Int64; "5/14/20" Int64; "5/15/20" Int64; "5/16/20" Int64; "5/17/20" Int64; "5/18/20" Int64; "5/19/20" Int64; "5/20/20" Int64; "5/21/20" Int64; "5/22/20" Int64; "5/23/20" Int64; "5/24/20" Int64; "5/25/20" Int64; "5/26/20" Int64; "5/27/20" Int64; "5/28/20" Int64; "5/29/20" Int64; "5/30/20" Int64; "5/31/20" Int64; "6/1/20" Int64; "6/2/20" Int64; "6/3/20" Int64; "6/4/20" Int64; "6/5/20" Int64; "6/6/20" Int64; "6/7/20" Int64; "6/8/20" Int64; "6/9/20" Int64; "6/10/20" Int64; "6/11/20" Int64; "6/12/20" Int64; "6/13/20" Int64; "6/14/20" Int64; "6/15/20" Int64; "6/16/20" Int64; "6/17/20" Int64; "6/18/20" Int64; "6/19/20" Int64; "6/20/20" Int64; "6/21/20" Int64; "6/22/20" Int64; "6/23/20" Int64; "6/24/20" Int64; "6/25/20" Int64; "6/26/20" Int64; "6/27/20" Int64; "6/28/20" Int64; "6/29/20" Int64; "6/30/20" Int64; "7/1/20" Int64; "7/2/20" Int64; "7/3/20" Int64; "7/4/20" Int64; "7/5/20" Int64; "7/6/20" Int64; "7/7/20" Int64; "7/8/20" Int64; "7/9/20" Int64; "7/10/20" Int64; "7/11/20" Int64; "7/12/20" Int64; "7/13/20" Int64; "7/14/20" Int64; "7/15/20" Int64; "7/16/20" Int64; "7/17/20" Int64; "7/18/20" Int64; "7/19/20" Int64; "7/20/20" Int64; "7/21/20" Int64; "7/22/20" Int64; "7/23/20" Int64; "7/24/20" Int64; "7/25/20" Int64; "7/26/20" Int64; "7/27/20" Int64; "7/28/20" Int64; "7/29/20" Int64; "7/30/20" Int64; "7/31/20" Int64; "8/1/20" Int64; "8/2/20" Int64; "8/3/20" Int64; "8/4/20" Int64; "8/5/20" Int64; "8/6/20" Int64; "8/7/20" Int64; "8/8/20" Int64; "8/9/20" Int64; "8/10/20" Int64; "8/11/20" Int64; "8/12/20" Int64; "8/13/20" Int64; "8/14/20" Int64; "8/15/20" Int64; "8/16/20" Int64; "8/17/20" Int64; "8/18/20" Int64; "8/19/20" Int64; "8/20/20" Int64; "8/21/20" Int64; "8/22/20" Int64; "8/23/20" Int64; "8/24/20" Int64; "8/25/20" Int64; "8/26/20" Int64; "8/27/20" Int64; "8/28/20" Int64; "8/29/20" Int64; "8/30/20" Int64; "8/31/20" Int64; "9/1/20" Int64; "9/2/20" Int64; "9/3/20" Int64; "9/4/20" Int64; "9/5/20" Int64; "9/6/20" Int64; "9/7/20" Int64; "9/8/20" Int64; "9/9/20" Int64; "9/10/20" Int64; "9/11/20" Int64; "9/12/20" Int64; "9/13/20" Int64; "9/14/20" Int64; "9/15/20" Int64; "9/16/20" Int64; "9/17/20" Int64; "9/18/20" Int64; "9/19/20" Int64; "9/20/20" Int64; "9/21/20" Int64; "9/22/20" Int64; "9/23/20" Int64; "9/24/20" Int64; "9/25/20" Int64; "9/26/20" Int64; "9/27/20" Int64; "9/28/20" Int64; "9/29/20" Int64; "9/30/20" Int64; "10/1/20" Int64; "10/2/20" Int64; "10/3/20" Int64; "10/4/20" Int64; "10/5/20" Int64; "10/6/20" Int64; "10/7/20" Int64; "10/8/20" Int64; "10/9/20" Int64; "10/10/20" Int64; "10/11/20" Int64; "10/12/20" Int64; "10/13/20" Int64; "10/14/20" Int64; "10/15/20" Int64; "10/16/20" Int64; "10/17/20" Int64; "10/18/20" Int64; "10/19/20" Int64; "10/20/20" Int64; "10/21/20" Int64; "10/22/20" Int64; "10/23/20" Int64; "10/24/20" Int64; "10/25/20" Int64; "10/26/20" Int64; "10/27/20" Int64; "10/28/20" Int64; "10/29/20" Int64; "10/30/20" Int64; "10/31/20" Int64; "11/1/20" Int64]
10.269000053405762
In [11]:
print(last(covidcases[:,[1,2,3,4,5,6,7,8]],15))
15×8 DataFrame
│ Row │ UID      │ iso2   │ iso3   │ code3 │ FIPS     │ Admin2     │ Province_State │ Country_Region │
│     │ Int64    │ String │ String │ Int64 │ Float64? │ String?    │ String         │ String         │
├─────┼──────────┼────────┼────────┼───────┼──────────┼────────────┼────────────────┼────────────────┤
│ 1   │ 84056021 │ US     │ USA    │ 840   │ 56021.0  │ Laramie    │ Wyoming        │ US             │
│ 2   │ 84056023 │ US     │ USA    │ 840   │ 56023.0  │ Lincoln    │ Wyoming        │ US             │
│ 3   │ 84056025 │ US     │ USA    │ 840   │ 56025.0  │ Natrona    │ Wyoming        │ US             │
│ 4   │ 84056027 │ US     │ USA    │ 840   │ 56027.0  │ Niobrara   │ Wyoming        │ US             │
│ 5   │ 84080056 │ US     │ USA    │ 840   │ 80056.0  │ Out of WY  │ Wyoming        │ US             │
│ 6   │ 84056029 │ US     │ USA    │ 840   │ 56029.0  │ Park       │ Wyoming        │ US             │
│ 7   │ 84056031 │ US     │ USA    │ 840   │ 56031.0  │ Platte     │ Wyoming        │ US             │
│ 8   │ 84056033 │ US     │ USA    │ 840   │ 56033.0  │ Sheridan   │ Wyoming        │ US             │
│ 9   │ 84056035 │ US     │ USA    │ 840   │ 56035.0  │ Sublette   │ Wyoming        │ US             │
│ 10  │ 84056037 │ US     │ USA    │ 840   │ 56037.0  │ Sweetwater │ Wyoming        │ US             │
│ 11  │ 84056039 │ US     │ USA    │ 840   │ 56039.0  │ Teton      │ Wyoming        │ US             │
│ 12  │ 84056041 │ US     │ USA    │ 840   │ 56041.0  │ Uinta      │ Wyoming        │ US             │
│ 13  │ 84090056 │ US     │ USA    │ 840   │ 90056.0  │ Unassigned │ Wyoming        │ US             │
│ 14  │ 84056043 │ US     │ USA    │ 840   │ 56043.0  │ Washakie   │ Wyoming        │ US             │
│ 15  │ 84056045 │ US     │ USA    │ 840   │ 56045.0  │ Weston     │ Wyoming        │ US             │

Pivot this initial dataframe so that dates/cumulative cases are represented in rows rather than columns. Change the “date” attribute from string to date and sort the dataframe by (uid,date).

In [12]:
start = time()
nc = ncol(covidcases);
covidcasesm = DataFrames.stack(covidcases,12:nc);
covidcasesm = DataFrames.rename(covidcasesm, :variable => :date);
covidcasesm = DataFrames.rename(covidcasesm, :value => :cumcases);
rename!(covidcasesm, mknames(covidcasesm)) 
df = DateFormat("mm/dd/yy");
covidcasesm[:date] = [Date(string(d),df) for d in covidcasesm[:date]]
sort!(covidcasesm, [:uid, :date]);
metaj(covidcasesm);
@timend
pr(2)
DataFrame
(951900, 13)
Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "date" Date; "cumcases" Int64]

5.414999961853027

Commence split/apply/combine processing, first grouping the cases dataframe by uid.

In [13]:
start = time()
grpvars = [:uid]
gdf = DataFrames.groupby(covidcasesm,grpvars);
@timend
pr(2)
0.6519999504089355
Define a simple function that computes dailycases as the difference in consecutive cumulative counts.
In [14]:
function diffsm(a)
   return cat([missing],diff(a),dims=1)
end
pr(2)

Compute dailycases and concatenate the new column to the just-melted dataframe by applying the diffsm function within uid groups.

In [15]:
start = time()
slug = combine(gdf,:cumcases => diffsm => :dailycases)
metaj(slug)
@timend
pr(2)
DataFrame
(951900, 2)
Any["uid" Int64; "dailycases" Union{Missing, Int64}]
2.8910000324249268

Bind the dailycases column to the dataframe.

In [16]:
start = time()
covidcasesm = hcat(covidcasesm[:,1:13],slug[[:dailycases]]);
metaj(covidcasesm)
@timend
pr(2)
DataFrame
(951900, 14)
Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "date" Date; "cumcases" Int64; "dailycases" Union{Missing, Int64}]
0.2049999237060547

Quick check on the total number of cases.

In [17]:
print(sumna(covidcasesm[:dailycases]))
pr(2)
9206974
Now perform the same sequence of steps with fatalities. First, load the unnormalized data.
In [18]:
start = time()
burl = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/";
fname = "time_series_covid19_deaths_US.csv";
coviddeaths = CSV.read(download(string(burl,fname)))
metaj(coviddeaths)
@timend
pr(2)
DataFrame
(3340, 297)
Any["UID" Int64; "iso2" String; "iso3" String; "code3" Int64; "FIPS" Union{Missing, Float64}; "Admin2" Union{Missing, String}; "Province_State" String; "Country_Region" String; "Lat" Float64; "Long_" Float64; "Combined_Key" String; "Population" Int64; "1/22/20" Int64; "1/23/20" Int64; "1/24/20" Int64; "1/25/20" Int64; "1/26/20" Int64; "1/27/20" Int64; "1/28/20" Int64; "1/29/20" Int64; "1/30/20" Int64; "1/31/20" Int64; "2/1/20" Int64; "2/2/20" Int64; "2/3/20" Int64; "2/4/20" Int64; "2/5/20" Int64; "2/6/20" Int64; "2/7/20" Int64; "2/8/20" Int64; "2/9/20" Int64; "2/10/20" Int64; "2/11/20" Int64; "2/12/20" Int64; "2/13/20" Int64; "2/14/20" Int64; "2/15/20" Int64; "2/16/20" Int64; "2/17/20" Int64; "2/18/20" Int64; "2/19/20" Int64; "2/20/20" Int64; "2/21/20" Int64; "2/22/20" Int64; "2/23/20" Int64; "2/24/20" Int64; "2/25/20" Int64; "2/26/20" Int64; "2/27/20" Int64; "2/28/20" Int64; "2/29/20" Int64; "3/1/20" Int64; "3/2/20" Int64; "3/3/20" Int64; "3/4/20" Int64; "3/5/20" Int64; "3/6/20" Int64; "3/7/20" Int64; "3/8/20" Int64; "3/9/20" Int64; "3/10/20" Int64; "3/11/20" Int64; "3/12/20" Int64; "3/13/20" Int64; "3/14/20" Int64; "3/15/20" Int64; "3/16/20" Int64; "3/17/20" Int64; "3/18/20" Int64; "3/19/20" Int64; "3/20/20" Int64; "3/21/20" Int64; "3/22/20" Int64; "3/23/20" Int64; "3/24/20" Int64; "3/25/20" Int64; "3/26/20" Int64; "3/27/20" Int64; "3/28/20" Int64; "3/29/20" Int64; "3/30/20" Int64; "3/31/20" Int64; "4/1/20" Int64; "4/2/20" Int64; "4/3/20" Int64; "4/4/20" Int64; "4/5/20" Int64; "4/6/20" Int64; "4/7/20" Int64; "4/8/20" Int64; "4/9/20" Int64; "4/10/20" Int64; "4/11/20" Int64; "4/12/20" Int64; "4/13/20" Int64; "4/14/20" Int64; "4/15/20" Int64; "4/16/20" Int64; "4/17/20" Int64; "4/18/20" Int64; "4/19/20" Int64; "4/20/20" Int64; "4/21/20" Int64; "4/22/20" Int64; "4/23/20" Int64; "4/24/20" Int64; "4/25/20" Int64; "4/26/20" Int64; "4/27/20" Int64; "4/28/20" Int64; "4/29/20" Int64; "4/30/20" Int64; "5/1/20" Int64; "5/2/20" Int64; "5/3/20" Int64; "5/4/20" Int64; "5/5/20" Int64; "5/6/20" Int64; "5/7/20" Int64; "5/8/20" Int64; "5/9/20" Int64; "5/10/20" Int64; "5/11/20" Int64; "5/12/20" Int64; "5/13/20" Int64; "5/14/20" Int64; "5/15/20" Int64; "5/16/20" Int64; "5/17/20" Int64; "5/18/20" Int64; "5/19/20" Int64; "5/20/20" Int64; "5/21/20" Int64; "5/22/20" Int64; "5/23/20" Int64; "5/24/20" Int64; "5/25/20" Int64; "5/26/20" Int64; "5/27/20" Int64; "5/28/20" Int64; "5/29/20" Int64; "5/30/20" Int64; "5/31/20" Int64; "6/1/20" Int64; "6/2/20" Int64; "6/3/20" Int64; "6/4/20" Int64; "6/5/20" Int64; "6/6/20" Int64; "6/7/20" Int64; "6/8/20" Int64; "6/9/20" Int64; "6/10/20" Int64; "6/11/20" Int64; "6/12/20" Int64; "6/13/20" Int64; "6/14/20" Int64; "6/15/20" Int64; "6/16/20" Int64; "6/17/20" Int64; "6/18/20" Int64; "6/19/20" Int64; "6/20/20" Int64; "6/21/20" Int64; "6/22/20" Int64; "6/23/20" Int64; "6/24/20" Int64; "6/25/20" Int64; "6/26/20" Int64; "6/27/20" Int64; "6/28/20" Int64; "6/29/20" Int64; "6/30/20" Int64; "7/1/20" Int64; "7/2/20" Int64; "7/3/20" Int64; "7/4/20" Int64; "7/5/20" Int64; "7/6/20" Int64; "7/7/20" Int64; "7/8/20" Int64; "7/9/20" Int64; "7/10/20" Int64; "7/11/20" Int64; "7/12/20" Int64; "7/13/20" Int64; "7/14/20" Int64; "7/15/20" Int64; "7/16/20" Int64; "7/17/20" Int64; "7/18/20" Int64; "7/19/20" Int64; "7/20/20" Int64; "7/21/20" Int64; "7/22/20" Int64; "7/23/20" Int64; "7/24/20" Int64; "7/25/20" Int64; "7/26/20" Int64; "7/27/20" Int64; "7/28/20" Int64; "7/29/20" Int64; "7/30/20" Int64; "7/31/20" Int64; "8/1/20" Int64; "8/2/20" Int64; "8/3/20" Int64; "8/4/20" Int64; "8/5/20" Int64; "8/6/20" Int64; "8/7/20" Int64; "8/8/20" Int64; "8/9/20" Int64; "8/10/20" Int64; "8/11/20" Int64; "8/12/20" Int64; "8/13/20" Int64; "8/14/20" Int64; "8/15/20" Int64; "8/16/20" Int64; "8/17/20" Int64; "8/18/20" Int64; "8/19/20" Int64; "8/20/20" Int64; "8/21/20" Int64; "8/22/20" Int64; "8/23/20" Int64; "8/24/20" Int64; "8/25/20" Int64; "8/26/20" Int64; "8/27/20" Int64; "8/28/20" Int64; "8/29/20" Int64; "8/30/20" Int64; "8/31/20" Int64; "9/1/20" Int64; "9/2/20" Int64; "9/3/20" Int64; "9/4/20" Int64; "9/5/20" Int64; "9/6/20" Int64; "9/7/20" Int64; "9/8/20" Int64; "9/9/20" Int64; "9/10/20" Int64; "9/11/20" Int64; "9/12/20" Int64; "9/13/20" Int64; "9/14/20" Int64; "9/15/20" Int64; "9/16/20" Int64; "9/17/20" Int64; "9/18/20" Int64; "9/19/20" Int64; "9/20/20" Int64; "9/21/20" Int64; "9/22/20" Int64; "9/23/20" Int64; "9/24/20" Int64; "9/25/20" Int64; "9/26/20" Int64; "9/27/20" Int64; "9/28/20" Int64; "9/29/20" Int64; "9/30/20" Int64; "10/1/20" Int64; "10/2/20" Int64; "10/3/20" Int64; "10/4/20" Int64; "10/5/20" Int64; "10/6/20" Int64; "10/7/20" Int64; "10/8/20" Int64; "10/9/20" Int64; "10/10/20" Int64; "10/11/20" Int64; "10/12/20" Int64; "10/13/20" Int64; "10/14/20" Int64; "10/15/20" Int64; "10/16/20" Int64; "10/17/20" Int64; "10/18/20" Int64; "10/19/20" Int64; "10/20/20" Int64; "10/21/20" Int64; "10/22/20" Int64; "10/23/20" Int64; "10/24/20" Int64; "10/25/20" Int64; "10/26/20" Int64; "10/27/20" Int64; "10/28/20" Int64; "10/29/20" Int64; "10/30/20" Int64; "10/31/20" Int64; "11/1/20" Int64]

0.5740001201629639

Next, pivot columns to rows as above.

In [19]:
nc = ncol(coviddeaths);
coviddeathsm = DataFrames.stack(coviddeaths,13:nc);
coviddeathsm = DataFrames.rename(coviddeathsm, :variable => :date);
coviddeathsm = DataFrames.rename(coviddeathsm, :value => :cumdeaths);
rename!(coviddeathsm, mknames(coviddeathsm)) 
df = DateFormat("mm/dd/yy");
coviddeathsm[:date] = [Date(string(d),df) for d in coviddeathsm[:date]]
sort!(coviddeathsm, [:uid, :date]);
metaj(coviddeathsm)
pr(2)
DataFrame
(951900, 14)
Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "population" Int64; "date" Date; "cumdeaths" Int64]

As with cases, group by uid and compute row differences of cumulative deaths by date within uid’s to instantiate dailydeaths.
In [20]:
grpvars = [:uid]
gdf = DataFrames.groupby(coviddeathsm,grpvars);
pr(2)
In [21]:
slug = combine(gdf,:cumdeaths => diffsm => :dailydeaths)
metaj(slug)
pr(2)
DataFrame
(951900, 2)
Any["uid" Int64; "dailydeaths" Union{Missing, Int64}]

Column bind.

In [22]:
coviddeathsm = hcat(coviddeathsm[:,1:14],slug[[:dailydeaths]]);
metaj(coviddeathsm)
pr(2)
DataFrame
(951900, 15)
Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "population" Int64; "date" Date; "cumdeaths" Int64; "dailydeaths" Union{Missing, Int64}]

The latest cumulative fatalities according to Johns Hopkins COVID Data ...
In [23]:
print(sumna(coviddeathsm.dailydeaths))
pr(2)
230995

Join the cases and deaths dataframes by (uid,date).

In [24]:
start = time()
covidfinal = innerjoin(covidcasesm,coviddeathsm[[:uid,:date,:cumdeaths,:dailydeaths]], on = [:uid => :uid, :date => :date]);
metaj(covidfinal)
@timend
pr(2)
DataFrame
(951900, 16)
Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "date" Date; "cumcases" Int64; "dailycases" Union{Missing, Int64}; "cumdeaths" Int64; "dailydeaths" Union{Missing, Int64}]
2.496000051498413
Confirm total cases and fatalities.
In [25]:
println(sumna(covidfinal.dailycases))
print(sumna(covidfinal[:dailydeaths]))
pr(2)
9206974
230995

 

Define a function to compute rolling or moving averages with dataframe count columns. I’ll consider alternatives later.

In [26]:
function masm_1(a;window=14)
    a = convert(Array{Float64,1}, a)
    lena = length(a);
    b = similar(a)
    for j in lena:-1:window
        summ = 0
        for jj in j:-1:(j-window+1)
            summ += a[jj]
        end
        b[j] = round(summ/window)
    end
    return(cat(fill(missing,window-1),b[window:lena],dims=1))
end
pr(2)
Use the “by” statement to create a summarized covidfinal dataframe by date, adding day of the week, along with cumulative sums and moving averages of daily cases and fatalities. The last 14 days are ominous.
In [27]:
ustotal = by(covidfinal, [:date], :dailycases => sumna => :dailycases, :dailydeaths => sumna => :dailydeaths);
ustotal["dayofweek"] = Dates.format.(ustotal.date,"e")
ustotal[:cumcases] = cumsum(ustotal.dailycases)
ustotal[:cumdeaths] = cumsum(ustotal.dailydeaths)
ustotal[:macases7] = masm_1(ustotal[:dailycases],window=7);
ustotal[:madeaths7] = masm_1(ustotal[:dailydeaths],window=7);
ustotal[:macases14] = masm_1(ustotal[:dailycases],window=14);
ustotal[:madeaths14] = masm_1(ustotal[:dailydeaths],window=14);
metaj(ustotal)
pr(2)
print(last(ustotal,14))
pr(2)
DataFrame
(285, 10)
Any["date" Date; "dailycases" Int64; "dailydeaths" Int64; "dayofweek" String; "cumcases" Int64; "cumdeaths" Int64; "macases7" Union{Missing, Float64}; "madeaths7" Union{Missing, Float64}; "macases14" Union{Missing, Float64}; "madeaths14" Union{Missing, Float64}]

14×10 DataFrame
│ Row │ date       │ dailycases │ dailydeaths │ dayofweek │ cumcases │ cumdeaths │ macases7 │ madeaths7 │ macases14 │ madeaths14 │
│     │ Date       │ Int64      │ Int64       │ String    │ Int64    │ Int64     │ Float64? │ Float64?  │ Float64?  │ Float64?   │
├─────┼────────────┼────────────┼─────────────┼───────────┼──────────┼───────────┼──────────┼───────────┼───────────┼────────────┤
│ 1   │ 0020-10-19 │ 58395      │ 445         │ Mon       │ 8214508  │ 220122    │ 58425.0  │ 719.0     │ 53991.0   │ 710.0      │
│ 2   │ 0020-10-20 │ 60355      │ 933         │ Tue       │ 8274863  │ 221055    │ 59554.0  │ 738.0     │ 55305.0   │ 726.0      │
│ 3   │ 0020-10-21 │ 62776      │ 1125        │ Wed       │ 8337639  │ 222180    │ 60017.0  │ 758.0     │ 56194.0   │ 741.0      │
│ 4   │ 0020-10-22 │ 71701      │ 856         │ Thu       │ 8409340  │ 223036    │ 61166.0  │ 763.0     │ 57299.0   │ 734.0      │
│ 5   │ 0020-10-23 │ 83747      │ 943         │ Fri       │ 8493087  │ 223979    │ 63249.0  │ 772.0     │ 59178.0   │ 730.0      │
│ 6   │ 0020-10-24 │ 83730      │ 914         │ Sat       │ 8576817  │ 224893    │ 66990.0  │ 801.0     │ 61255.0   │ 751.0      │
│ 7   │ 0020-10-25 │ 60807      │ 340         │ Sun       │ 8637624  │ 225233    │ 68787.0  │ 794.0     │ 62412.0   │ 747.0      │
│ 8   │ 0020-10-26 │ 66798      │ 477         │ Mon       │ 8704422  │ 225710    │ 69988.0  │ 798.0     │ 64206.0   │ 759.0      │
│ 9   │ 0020-10-27 │ 73632      │ 986         │ Tue       │ 8778054  │ 226696    │ 71884.0  │ 806.0     │ 65719.0   │ 772.0      │
│ 10  │ 0020-10-28 │ 78358      │ 989         │ Wed       │ 8856412  │ 227685    │ 74110.0  │ 786.0     │ 67064.0   │ 772.0      │
│ 11  │ 0020-10-29 │ 88521      │ 971         │ Thu       │ 8944933  │ 228656    │ 76513.0  │ 803.0     │ 68840.0   │ 783.0      │
│ 12  │ 0020-10-30 │ 99321      │ 1030        │ Fri       │ 9044254  │ 229686    │ 78738.0  │ 815.0     │ 70994.0   │ 793.0      │
│ 13  │ 0020-10-31 │ 81227      │ 862         │ Sat       │ 9125481  │ 230548    │ 78381.0  │ 808.0     │ 72685.0   │ 804.0      │
│ 14  │ 0020-11-01 │ 81493      │ 447         │ Sun       │ 9206974  │ 230995    │ 81336.0  │ 823.0     │ 75062.0   │ 808.0      │
Why the interest in rolling/moving case/fatality averages? Summarize by day of the week and note the lower numbers on Sun-Mon. This reflects both a lag in reporting and weekend undercounts. Better to smooth over these rough edges with moving averages.
In [28]:
usdayofweek = by(ustotal, [:dayofweek], :dailycases => sumna => :dailycases, :dailydeaths => sumna => :dailydeaths);
usdayofweek["pctcases"] = round.(100*usdayofweek.dailycases/sumna(usdayofweek.dailycases);digits=2);
usdayofweek["pctdeaths"] = round.(100*usdayofweek.dailydeaths/sumna(usdayofweek.dailydeaths);digits=2);
print(hcat(frequencies(ustotal,["dayofweek"])[1:2],usdayofweek[[2,4,3,5]]))
pr(2)
7×6 DataFrame
│ Row │ dayofweek │ count │ dailycases │ pctcases │ dailydeaths │ pctdeaths │
│     │ String    │ Int64 │ Int64      │ Float64  │ Int64       │ Float64   │
├─────┼───────────┼───────┼────────────┼──────────┼─────────────┼───────────┤
│ 1   │ Wed       │ 41    │ 1316918    │ 14.3     │ 41693       │ 18.05     │
│ 2   │ Thu       │ 41    │ 1428144    │ 15.51    │ 38222       │ 16.55     │
│ 3   │ Fri       │ 41    │ 1529602    │ 16.61    │ 37455       │ 16.21     │
│ 4   │ Sat       │ 41    │ 1380229    │ 14.99    │ 31917       │ 13.82     │
│ 5   │ Sun       │ 41    │ 1178217    │ 12.8     │ 19792       │ 8.57      │
│ 6   │ Mon       │ 40    │ 1127424    │ 12.25    │ 22645       │ 9.8       │
│ 7   │ Tue       │ 40    │ 1246440    │ 13.54    │ 39271       │ 17.0      │

Save covidfinal to a portable feather file for subsequent use in Julia, Python, and R.

In [29]:
wdir = "/data/covid19/julia"
cd(wdir)
fname = "covidfinal.feather"
Feather.write(fname,covidfinal)
pr(2)

Perform calculations used in subsequent graphs.

In [30]:
tit = @sprintf "%s%s to %s\nCases: %s; Deaths: %s\n" "Covid 19 in the U.S.\n" first(ustotal).date last(ustotal).date sumna(ustotal.dailycases) sumna(ustotal.dailydeaths)
annodate = last(ustotal).date + Dates.Day(15)
annodatestr = string(last(ustotal).date)
annocases = last(ustotal).macases7
annocasesstr = last(ustotal).macases7
annodeaths = last(ustotal).madeaths14
annodeathsstr = last(ustotal).madeaths14
pr(2)

Produce Gadfly graphs of case and fatality moving averages.

In [31]:
set_default_plot_size(12cm, 12cm)
Gadfly.push_theme(:dark)
pr(2)
g = Gadfly.plot(ustotal, x=:date, y=:macases7,  
Gadfly.Geom.line,
    Gadfly.Scale.y_continuous(minvalue=0,format=:plain),
    Guide.XLabel("Month"),
    Guide.YLabel("7 day MA cases"),
    Guide.Title(tit))
g
Out[32]:
MonthFebMarAprMayJunJulAugSepOcth,j,k,l,arrows,drag to pani,o,+,-,scroll,shift-drag to zoomr,dbl-click to resetc for coordinates? for help?2500050000750001000007 day MA casesCovid 19 in the U.S.0020-01-22 to 0020-11-01Cases: 9206974; Deaths: 230995
In [33]:
g = Gadfly.plot(ustotal, x=:date, y=:madeaths14, 
    Gadfly.Geom.line,
    Gadfly.Scale.y_continuous(minvalue=0,format=:plain),
    Guide.XLabel("Month"),
    Guide.YLabel("14 day MA deaths"),
    Guide.Title(tit))
g
Out[33]:
Month Jan 1, 0020 Feb Mar Apr May Jun Jul Aug Sep Oct Nov Jan 1, 0019 0020 0021 0022 Jan 1, 0019 0020 0021 0022 Jan 1, 0019 0020 0021 0022 h,j,k,l,arrows,drag to pan i,o,+,-,scroll,shift-drag to zoom r,dbl-click to reset c for coordinates ? for help ? -3000 -2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 -2500 -2400 -2300 -2200 -2100 -2000 -1900 -1800 -1700 -1600 -1500 -1400 -1300 -1200 -1100 -1000 -900 -800 -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 -2500 0 2500 5000 -2600 -2400 -2200 -2000 -1800 -1600 -1400 -1200 -1000 -800 -600 -400 -200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800 5000 14 day MA deaths Covid 19 in the U.S.0020-01-22 to 0020-11-01Cases: 9206974; Deaths: 230995

Consider an alternative moving average function adapted from the package RollingFunctions. It’s generally preferable to reuse rather than build from scratch.

In [34]:
function masm_2(vec,win)
    return(cat(fill(missing,win-1),RollingFunctions.rollmean(vec,win),dims=1))
end
pr(2)

Check.

In [35]:
print(last(DataFrame(macases14_1=round.(masm_2(ustotal.dailycases,14);digits=0),macases14_2=ustotal.macases14),14))
pr(2)
14×2 DataFrame
│ Row │ macases14_1 │ macases14_2 │
│     │ Float64?    │ Float64?    │
├─────┼─────────────┼─────────────┤
│ 1   │ 53991.0     │ 53991.0     │
│ 2   │ 55305.0     │ 55305.0     │
│ 3   │ 56194.0     │ 56194.0     │
│ 4   │ 57299.0     │ 57299.0     │
│ 5   │ 59178.0     │ 59178.0     │
│ 6   │ 61255.0     │ 61255.0     │
│ 7   │ 62412.0     │ 62412.0     │
│ 8   │ 64206.0     │ 64206.0     │
│ 9   │ 65719.0     │ 65719.0     │
│ 10  │ 67064.0     │ 67064.0     │
│ 11  │ 68840.0     │ 68840.0     │
│ 12  │ 70994.0     │ 70994.0     │
│ 13  │ 72685.0     │ 72685.0     │
│ 14  │ 75062.0     │ 75062.0     │

That’s it for now. Next time I’ll explore interoperating Julia with Python and R.


The path to a job in data science may vary. With the Ai+ Training Platform, you gain access to our massive library of data science training courses, workshops, keynotes, and talks. All skills are ideal for those looking to break into the field or to acquire the latest skills needed to get ahead. Some highlighted courses include:

SQL for Data Science: Mona Khalil | Senior Data Scientist | Greenhouse

Data Science in the Industry: Continuous Delivery for Machine Learning with Open-Source Tools: Team from ThoughtWorks, Inc.

How to do Data Science with Missing Data: Matt Brems | Managing Partner, Distinguished Faculty | BetaVector, General Assembly

Continuously Deployed Machine Learning: Max Humber | Lead Instructor | General Assembly

Steve Miller

Steve Miller

Steve is a data scientist focusing on research methods, data integration, statistics/ML, and computational analytics in R/python. Follow him on LinkedIn: https://www.linkedin.com/in/steve-miller-58ab881/

1