

Julia for Management/Analysis of Johns Hopkins COVID Data
ModelingCoronavirusCOVID 19Juliaposted by Steve Miller November 3, 2020 Steve Miller

Like many analytics geeks, I’ve been tracking data on the Covid pandemic since early spring. My source is the Center for Systems Science and Engineering at Johns Hopkins University, with files for download made available at midnight Central time. I’ve established a pretty significant R infrastructure in JupyterLab to summarize daily developments. The combination of R data.table and tidyverse programming capabilities has made the computation straightforward.
There are a few quirks with the data, though, that make it a suitable challenge for less mature data management platforms. The Hopkins data consists of two CSV files, the first detailing COVID cases by date in the U.S., the second U.S. Covid fatalities. Both files are organized similarly. The granularity is sub-state, generally representing counties or other geographic administrations. Each day’s data adds a new column depicting cumulative cases/fatalities for the day. The downloaded data must first be pivoted or melted to transform the date columns into rows. split/apply/combine computations within the administrative geography key, uid, then produce daily cases and fatalities. The cases and fatalities dataframes are finally joined to produce a final that’s the source of analysis.
For this blog, I investigate the handling of my Covid data in Julia, using its dataframe package primarily. In subsequent posts, I’ll consider Python/Pandas and the new Python datatable. The focus of this exercise is data analysis, consisting of basic wrangling plus the display of computations — both in list and graphical modes. Heavier statistical/ML exercises will come in future blogs.
The code that follows first reads and munges daily cases and fatalities CSV files downloaded from the Hopkins Covid data portal, computing additional attributes such as running sums and moving averages. In the end, I display stats with both listings and graphics.
The results of this inquiry are encouraging. Indeed, I ran into a few snarls in the programming journey. I found Julia 1.5.2 much easier to work with than its predecessors, the result of both functionality enhancement/stabilization and documentation availability — especially tips in StackOverFlow. That’s a welcome sign. The performance of Julia is acceptable, actually better than I’ve experienced so far, though still lagging R and Python on JupyterLab. At least part of this deficit I suspect is that Windows and Jupyter are not ideal Julia staging platforms. In addition, I believe Julia shines more in intensive algorithmic computation than it does in data management/analysis.
[Related article: Jupyter Notebook: Python or R—Or Both?]
The software used below is JupyterLab 2.1.2, Julia 1.5.2, Python 3.7.5, and R 4.0.2. I’ll continue investigating Julia for data analysis next time, with a special focus then on divining interoperability with R and Python.
############################################################ ############################## ################## ############################## ############################## ############################## ##################
macro timend() stop = time() elapsed = stop-start println(elapsed) end println();println()
function pr(howmany) for i = 1:howmany println() end end pr(2)
function showln(obj) show(obj) pr(2) end pr(2)
Load pertinent packages.
using DataFrames using PyCall using RCall using RDatasets using CSV using DataFramesMeta using Base using Dates using Feather using Gadfly using DataTables using RollingFunctions using Printf pr(2) [1] "C:/R/4.0.2/library" "C:/Program Files/R/R-4.0.2/library"
Define dataframe meta data and frequencies functions, plus a few convenience ones.
function metaj(df;data=false) println(DataFrames.typeof(df)) println(DataFrames.size(df)) println(hcat(names(df),eltypes(df))) if data print(DataFrames.first(df,6)) pr(1) print(DataFrames.last(df,6)) end pr(2) end pr(2)
function frequencies(df,vars) freqs = combine(DataFrames.groupby(df, vars), nrow) freqs = DataFrames.rename(freqs, :nrow => :count); freqs[!,:percent] = 100*freqs.count/sum(freqs.count); sort!(freqs, [DataFrames.order(:count, rev = true)]); return(freqs) end pr(2)
function sumna(vec) return(sum(skipmissing(vec))) end pr(2)
function mknames(df) from = names(df); to = [Symbol(replace(lowercase(string(f))," " => "")) for f in from]; to = [Symbol(replace(lowercase(stri ng(t)),"_" => "")) for t in to]; return(to) end pr(2)
Download the latest U.S. Johns Hopkins COVID Data. Note the repeating group structure of daily date which holds cumulative cases for geography identified by key UID.
start = time() burl = "https://raw.githubusercontent.com/ CSSEGISandData/COVID-19/ master/csse_covid_19_data/"; fname = "csse_covid_19_time_series/ time_series_covid19_confirmed_ US.csv"; covidcases = CSV.read(download(string(burl, fname))); metaj(covidcases) @timend pr(2)
(3340, 296) Any["UID" Int64; "iso2" String; "iso3" String; "code3" Int64; "FIPS" Union{Missing, Float64}; "Admin2" Union{Missing, String}; "Province_State" String; "Country_Region" String; "Lat" Float64; "Long_" Float64; "Combined_Key" String; "1/22/20" Int64; "1/23/20" Int64; "1/24/20" Int64; "1/25/20" Int64; "1/26/20" Int64; "1/27/20" Int64; "1/28/20" Int64; "1/29/20" Int64; "1/30/20" Int64; "1/31/20" Int64; "2/1/20" Int64; "2/2/20" Int64; "2/3/20" Int64; "2/4/20" Int64; "2/5/20" Int64; "2/6/20" Int64; "2/7/20" Int64; "2/8/20" Int64; "2/9/20" Int64; "2/10/20" Int64; "2/11/20" Int64; "2/12/20" Int64; "2/13/20" Int64; "2/14/20" Int64; "2/15/20" Int64; "2/16/20" Int64; "2/17/20" Int64; "2/18/20" Int64; "2/19/20" Int64; "2/20/20" Int64; "2/21/20" Int64; "2/22/20" Int64; "2/23/20" Int64; "2/24/20" Int64; "2/25/20" Int64; "2/26/20" Int64; "2/27/20" Int64; "2/28/20" Int64; "2/29/20" Int64; "3/1/20" Int64; "3/2/20" Int64; "3/3/20" Int64; "3/4/20" Int64; "3/5/20" Int64; "3/6/20" Int64; "3/7/20" Int64; "3/8/20" Int64; "3/9/20" Int64; "3/10/20" Int64; "3/11/20" Int64; "3/12/20" Int64; "3/13/20" Int64; "3/14/20" Int64; "3/15/20" Int64; "3/16/20" Int64; "3/17/20" Int64; "3/18/20" Int64; "3/19/20" Int64; "3/20/20" Int64; "3/21/20" Int64; "3/22/20" Int64; "3/23/20" Int64; "3/24/20" Int64; "3/25/20" Int64; "3/26/20" Int64; "3/27/20" Int64; "3/28/20" Int64; "3/29/20" Int64; "3/30/20" Int64; "3/31/20" Int64; "4/1/20" Int64; "4/2/20" Int64; "4/3/20" Int64; "4/4/20" Int64; "4/5/20" Int64; "4/6/20" Int64; "4/7/20" Int64; "4/8/20" Int64; "4/9/20" Int64; "4/10/20" Int64; "4/11/20" Int64; "4/12/20" Int64; "4/13/20" Int64; "4/14/20" Int64; "4/15/20" Int64; "4/16/20" Int64; "4/17/20" Int64; "4/18/20" Int64; "4/19/20" Int64; "4/20/20" Int64; "4/21/20" Int64; "4/22/20" Int64; "4/23/20" Int64; "4/24/20" Int64; "4/25/20" Int64; "4/26/20" Int64; "4/27/20" Int64; "4/28/20" Int64; "4/29/20" Int64; "4/30/20" Int64; "5/1/20" Int64; "5/2/20" Int64; "5/3/20" Int64; "5/4/20" Int64; "5/5/20" Int64; "5/6/20" Int64; "5/7/20" Int64; "5/8/20" Int64; "5/9/20" Int64; "5/10/20" Int64; "5/11/20" Int64; "5/12/20" Int64; "5/13/20" Int64; "5/14/20" Int64; "5/15/20" Int64; "5/16/20" Int64; "5/17/20" Int64; "5/18/20" Int64; "5/19/20" Int64; "5/20/20" Int64; "5/21/20" Int64; "5/22/20" Int64; "5/23/20" Int64; "5/24/20" Int64; "5/25/20" Int64; "5/26/20" Int64; "5/27/20" Int64; "5/28/20" Int64; "5/29/20" Int64; "5/30/20" Int64; "5/31/20" Int64; "6/1/20" Int64; "6/2/20" Int64; "6/3/20" Int64; "6/4/20" Int64; "6/5/20" Int64; "6/6/20" Int64; "6/7/20" Int64; "6/8/20" Int64; "6/9/20" Int64; "6/10/20" Int64; "6/11/20" Int64; "6/12/20" Int64; "6/13/20" Int64; "6/14/20" Int64; "6/15/20" Int64; "6/16/20" Int64; "6/17/20" Int64; "6/18/20" Int64; "6/19/20" Int64; "6/20/20" Int64; "6/21/20" Int64; "6/22/20" Int64; "6/23/20" Int64; "6/24/20" Int64; "6/25/20" Int64; "6/26/20" Int64; "6/27/20" Int64; "6/28/20" Int64; "6/29/20" Int64; "6/30/20" Int64; "7/1/20" Int64; "7/2/20" Int64; "7/3/20" Int64; "7/4/20" Int64; "7/5/20" Int64; "7/6/20" Int64; "7/7/20" Int64; "7/8/20" Int64; "7/9/20" Int64; "7/10/20" Int64; "7/11/20" Int64; "7/12/20" Int64; "7/13/20" Int64; "7/14/20" Int64; "7/15/20" Int64; "7/16/20" Int64; "7/17/20" Int64; "7/18/20" Int64; "7/19/20" Int64; "7/20/20" Int64; "7/21/20" Int64; "7/22/20" Int64; "7/23/20" Int64; "7/24/20" Int64; "7/25/20" Int64; "7/26/20" Int64; "7/27/20" Int64; "7/28/20" Int64; "7/29/20" Int64; "7/30/20" Int64; "7/31/20" Int64; "8/1/20" Int64; "8/2/20" Int64; "8/3/20" Int64; "8/4/20" Int64; "8/5/20" Int64; "8/6/20" Int64; "8/7/20" Int64; "8/8/20" Int64; "8/9/20" Int64; "8/10/20" Int64; "8/11/20" Int64; "8/12/20" Int64; "8/13/20" Int64; "8/14/20" Int64; "8/15/20" Int64; "8/16/20" Int64; "8/17/20" Int64; "8/18/20" Int64; "8/19/20" Int64; "8/20/20" Int64; "8/21/20" Int64; "8/22/20" Int64; "8/23/20" Int64; "8/24/20" Int64; "8/25/20" Int64; "8/26/20" Int64; "8/27/20" Int64; "8/28/20" Int64; "8/29/20" Int64; "8/30/20" Int64; "8/31/20" Int64; "9/1/20" Int64; "9/2/20" Int64; "9/3/20" Int64; "9/4/20" Int64; "9/5/20" Int64; "9/6/20" Int64; "9/7/20" Int64; "9/8/20" Int64; "9/9/20" Int64; "9/10/20" Int64; "9/11/20" Int64; "9/12/20" Int64; "9/13/20" Int64; "9/14/20" Int64; "9/15/20" Int64; "9/16/20" Int64; "9/17/20" Int64; "9/18/20" Int64; "9/19/20" Int64; "9/20/20" Int64; "9/21/20" Int64; "9/22/20" Int64; "9/23/20" Int64; "9/24/20" Int64; "9/25/20" Int64; "9/26/20" Int64; "9/27/20" Int64; "9/28/20" Int64; "9/29/20" Int64; "9/30/20" Int64; "10/1/20" Int64; "10/2/20" Int64; "10/3/20" Int64; "10/4/20" Int64; "10/5/20" Int64; "10/6/20" Int64; "10/7/20" Int64; "10/8/20" Int64; "10/9/20" Int64; "10/10/20" Int64; "10/11/20" Int64; "10/12/20" Int64; "10/13/20" Int64; "10/14/20" Int64; "10/15/20" Int64; "10/16/20" Int64; "10/17/20" Int64; "10/18/20" Int64; "10/19/20" Int64; "10/20/20" Int64; "10/21/20" Int64; "10/22/20" Int64; "10/23/20" Int64; "10/24/20" Int64; "10/25/20" Int64; "10/26/20" Int64; "10/27/20" Int64; "10/28/20" Int64; "10/29/20" Int64; "10/30/20" Int64; "10/31/20" Int64; "11/1/20" Int64] 10.269000053405762
print(last(covidcases[:,[1,2,3,4,5,6,7,8]],15))
15×8 DataFrame │ Row │ UID │ iso2 │ iso3 │ code3 │ FIPS │ Admin2 │ Province_State │ Country_Region │ │ │ Int64 │ String │ String │ Int64 │ Float64? │ String? │ String │ String │ ├─────┼──────────┼────────┼────────┼───────┼──────────┼───── ───────┼────────────────┼───── ───────────┤ │ 1 │ 84056021 │ US │ USA │ 840 │ 56021.0 │ Laramie │ Wyoming │ US │ │ 2 │ 84056023 │ US │ USA │ 840 │ 56023.0 │ Lincoln │ Wyoming │ US │ │ 3 │ 84056025 │ US │ USA │ 840 │ 56025.0 │ Natrona │ Wyoming │ US │ │ 4 │ 84056027 │ US │ USA │ 840 │ 56027.0 │ Niobrara │ Wyoming │ US │ │ 5 │ 84080056 │ US │ USA │ 840 │ 80056.0 │ Out of WY │ Wyoming │ US │ │ 6 │ 84056029 │ US │ USA │ 840 │ 56029.0 │ Park │ Wyoming │ US │ │ 7 │ 84056031 │ US │ USA │ 840 │ 56031.0 │ Platte │ Wyoming │ US │ │ 8 │ 84056033 │ US │ USA │ 840 │ 56033.0 │ Sheridan │ Wyoming │ US │ │ 9 │ 84056035 │ US │ USA │ 840 │ 56035.0 │ Sublette │ Wyoming │ US │ │ 10 │ 84056037 │ US │ USA │ 840 │ 56037.0 │ Sweetwater │ Wyoming │ US │ │ 11 │ 84056039 │ US │ USA │ 840 │ 56039.0 │ Teton │ Wyoming │ US │ │ 12 │ 84056041 │ US │ USA │ 840 │ 56041.0 │ Uinta │ Wyoming │ US │ │ 13 │ 84090056 │ US │ USA │ 840 │ 90056.0 │ Unassigned │ Wyoming │ US │ │ 14 │ 84056043 │ US │ USA │ 840 │ 56043.0 │ Washakie │ Wyoming │ US │ │ 15 │ 84056045 │ US │ USA │ 840 │ 56045.0 │ Weston │ Wyoming │ US │
Pivot this initial dataframe so that dates/cumulative cases are represented in rows rather than columns. Change the “date” attribute from string to date and sort the dataframe by (uid,date).
start = time() nc = ncol(covidcases); covidcasesm = DataFrames.stack(covidcases,12:nc); covidcasesm = DataFrames.rename(covidcasesm, :variable => :date); covidcasesm = DataFrames.rename(covidcasesm, :value => :cumcases); rename!(covidcasesm, mknames(covidcasesm)) df = DateFormat("mm/dd/yy"); covidcasesm[:date] = [Date(string(d),df) for d in covidcasesm[:date]] sort!(covidcasesm, [:uid, :date]); metaj(covidcasesm); @timend pr(2)
DataFrame (951900, 13) Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "date" Date; "cumcases" Int64] 5.414999961853027
Commence split/apply/combine processing, first grouping the cases dataframe by uid.
start = time() grpvars = [:uid] gdf = DataFrames.groupby(covidcasesm,grpvars); @timend pr(2)
function diffsm(a) return cat([missing],diff(a),dims=1) end pr(2)
Compute dailycases and concatenate the new column to the just-melted dataframe by applying the diffsm function within uid groups.
start = time() slug = combine(gdf,:cumcases => diffsm => :dailycases) metaj(slug) @timend pr(2)
(951900, 2) Any["uid" Int64; "dailycases" Union{Missing, Int64}] 2.8910000324249268
Bind the dailycases column to the dataframe.
start = time() covidcasesm = hcat(covidcasesm[:,1:13],slug[[:dailycases]]); metaj(covidcasesm) @timend pr(2)
DataFrame (951900, 14) Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "date" Date; "cumcases" Int64; "dailycases" Union{Missing, Int64}] 0.2049999237060547
Quick check on the total number of cases.
print(sumna(covidcasesm[:dailycases])) pr(2) 9206974
start = time() burl = "https://raw.githubusercontent.com/ CSSEGISandData/COVID-19/ master/csse_covid_19_data/ csse_covid_19_time_series/"; fname = "time_series_covid19_deaths_ US.csv"; coviddeaths = CSV.read(download(string(burl, fname))) metaj(coviddeaths) @timend pr(2)
DataFrame (3340, 297) Any["UID" Int64; "iso2" String; "iso3" String; "code3" Int64; "FIPS" Union{Missing, Float64}; "Admin2" Union{Missing, String}; "Province_State" String; "Country_Region" String; "Lat" Float64; "Long_" Float64; "Combined_Key" String; "Population" Int64; "1/22/20" Int64; "1/23/20" Int64; "1/24/20" Int64; "1/25/20" Int64; "1/26/20" Int64; "1/27/20" Int64; "1/28/20" Int64; "1/29/20" Int64; "1/30/20" Int64; "1/31/20" Int64; "2/1/20" Int64; "2/2/20" Int64; "2/3/20" Int64; "2/4/20" Int64; "2/5/20" Int64; "2/6/20" Int64; "2/7/20" Int64; "2/8/20" Int64; "2/9/20" Int64; "2/10/20" Int64; "2/11/20" Int64; "2/12/20" Int64; "2/13/20" Int64; "2/14/20" Int64; "2/15/20" Int64; "2/16/20" Int64; "2/17/20" Int64; "2/18/20" Int64; "2/19/20" Int64; "2/20/20" Int64; "2/21/20" Int64; "2/22/20" Int64; "2/23/20" Int64; "2/24/20" Int64; "2/25/20" Int64; "2/26/20" Int64; "2/27/20" Int64; "2/28/20" Int64; "2/29/20" Int64; "3/1/20" Int64; "3/2/20" Int64; "3/3/20" Int64; "3/4/20" Int64; "3/5/20" Int64; "3/6/20" Int64; "3/7/20" Int64; "3/8/20" Int64; "3/9/20" Int64; "3/10/20" Int64; "3/11/20" Int64; "3/12/20" Int64; "3/13/20" Int64; "3/14/20" Int64; "3/15/20" Int64; "3/16/20" Int64; "3/17/20" Int64; "3/18/20" Int64; "3/19/20" Int64; "3/20/20" Int64; "3/21/20" Int64; "3/22/20" Int64; "3/23/20" Int64; "3/24/20" Int64; "3/25/20" Int64; "3/26/20" Int64; "3/27/20" Int64; "3/28/20" Int64; "3/29/20" Int64; "3/30/20" Int64; "3/31/20" Int64; "4/1/20" Int64; "4/2/20" Int64; "4/3/20" Int64; "4/4/20" Int64; "4/5/20" Int64; "4/6/20" Int64; "4/7/20" Int64; "4/8/20" Int64; "4/9/20" Int64; "4/10/20" Int64; "4/11/20" Int64; "4/12/20" Int64; "4/13/20" Int64; "4/14/20" Int64; "4/15/20" Int64; "4/16/20" Int64; "4/17/20" Int64; "4/18/20" Int64; "4/19/20" Int64; "4/20/20" Int64; "4/21/20" Int64; "4/22/20" Int64; "4/23/20" Int64; "4/24/20" Int64; "4/25/20" Int64; "4/26/20" Int64; "4/27/20" Int64; "4/28/20" Int64; "4/29/20" Int64; "4/30/20" Int64; "5/1/20" Int64; "5/2/20" Int64; "5/3/20" Int64; "5/4/20" Int64; "5/5/20" Int64; "5/6/20" Int64; "5/7/20" Int64; "5/8/20" Int64; "5/9/20" Int64; "5/10/20" Int64; "5/11/20" Int64; "5/12/20" Int64; "5/13/20" Int64; "5/14/20" Int64; "5/15/20" Int64; "5/16/20" Int64; "5/17/20" Int64; "5/18/20" Int64; "5/19/20" Int64; "5/20/20" Int64; "5/21/20" Int64; "5/22/20" Int64; "5/23/20" Int64; "5/24/20" Int64; "5/25/20" Int64; "5/26/20" Int64; "5/27/20" Int64; "5/28/20" Int64; "5/29/20" Int64; "5/30/20" Int64; "5/31/20" Int64; "6/1/20" Int64; "6/2/20" Int64; "6/3/20" Int64; "6/4/20" Int64; "6/5/20" Int64; "6/6/20" Int64; "6/7/20" Int64; "6/8/20" Int64; "6/9/20" Int64; "6/10/20" Int64; "6/11/20" Int64; "6/12/20" Int64; "6/13/20" Int64; "6/14/20" Int64; "6/15/20" Int64; "6/16/20" Int64; "6/17/20" Int64; "6/18/20" Int64; "6/19/20" Int64; "6/20/20" Int64; "6/21/20" Int64; "6/22/20" Int64; "6/23/20" Int64; "6/24/20" Int64; "6/25/20" Int64; "6/26/20" Int64; "6/27/20" Int64; "6/28/20" Int64; "6/29/20" Int64; "6/30/20" Int64; "7/1/20" Int64; "7/2/20" Int64; "7/3/20" Int64; "7/4/20" Int64; "7/5/20" Int64; "7/6/20" Int64; "7/7/20" Int64; "7/8/20" Int64; "7/9/20" Int64; "7/10/20" Int64; "7/11/20" Int64; "7/12/20" Int64; "7/13/20" Int64; "7/14/20" Int64; "7/15/20" Int64; "7/16/20" Int64; "7/17/20" Int64; "7/18/20" Int64; "7/19/20" Int64; "7/20/20" Int64; "7/21/20" Int64; "7/22/20" Int64; "7/23/20" Int64; "7/24/20" Int64; "7/25/20" Int64; "7/26/20" Int64; "7/27/20" Int64; "7/28/20" Int64; "7/29/20" Int64; "7/30/20" Int64; "7/31/20" Int64; "8/1/20" Int64; "8/2/20" Int64; "8/3/20" Int64; "8/4/20" Int64; "8/5/20" Int64; "8/6/20" Int64; "8/7/20" Int64; "8/8/20" Int64; "8/9/20" Int64; "8/10/20" Int64; "8/11/20" Int64; "8/12/20" Int64; "8/13/20" Int64; "8/14/20" Int64; "8/15/20" Int64; "8/16/20" Int64; "8/17/20" Int64; "8/18/20" Int64; "8/19/20" Int64; "8/20/20" Int64; "8/21/20" Int64; "8/22/20" Int64; "8/23/20" Int64; "8/24/20" Int64; "8/25/20" Int64; "8/26/20" Int64; "8/27/20" Int64; "8/28/20" Int64; "8/29/20" Int64; "8/30/20" Int64; "8/31/20" Int64; "9/1/20" Int64; "9/2/20" Int64; "9/3/20" Int64; "9/4/20" Int64; "9/5/20" Int64; "9/6/20" Int64; "9/7/20" Int64; "9/8/20" Int64; "9/9/20" Int64; "9/10/20" Int64; "9/11/20" Int64; "9/12/20" Int64; "9/13/20" Int64; "9/14/20" Int64; "9/15/20" Int64; "9/16/20" Int64; "9/17/20" Int64; "9/18/20" Int64; "9/19/20" Int64; "9/20/20" Int64; "9/21/20" Int64; "9/22/20" Int64; "9/23/20" Int64; "9/24/20" Int64; "9/25/20" Int64; "9/26/20" Int64; "9/27/20" Int64; "9/28/20" Int64; "9/29/20" Int64; "9/30/20" Int64; "10/1/20" Int64; "10/2/20" Int64; "10/3/20" Int64; "10/4/20" Int64; "10/5/20" Int64; "10/6/20" Int64; "10/7/20" Int64; "10/8/20" Int64; "10/9/20" Int64; "10/10/20" Int64; "10/11/20" Int64; "10/12/20" Int64; "10/13/20" Int64; "10/14/20" Int64; "10/15/20" Int64; "10/16/20" Int64; "10/17/20" Int64; "10/18/20" Int64; "10/19/20" Int64; "10/20/20" Int64; "10/21/20" Int64; "10/22/20" Int64; "10/23/20" Int64; "10/24/20" Int64; "10/25/20" Int64; "10/26/20" Int64; "10/27/20" Int64; "10/28/20" Int64; "10/29/20" Int64; "10/30/20" Int64; "10/31/20" Int64; "11/1/20" Int64] 0.5740001201629639
Next, pivot columns to rows as above.
nc = ncol(coviddeaths); coviddeathsm = DataFrames.stack(coviddeaths,13:nc); coviddeathsm = DataFrames.rename(coviddeathsm , :variable => :date); coviddeathsm = DataFrames.rename(coviddeathsm , :value => :cumdeaths); rename!(coviddeathsm, mknames(coviddeathsm)) df = DateFormat("mm/dd/yy"); coviddeathsm[:date] = [Date(string(d),df) for d in coviddeathsm[:date]] sort!(coviddeathsm, [:uid, :date]); metaj(coviddeathsm) pr(2)
(951900, 14) Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "population" Int64; "date" Date; "cumdeaths" Int64]
grpvars = [:uid] gdf = DataFrames.groupby(coviddeathsm,grpvars); pr(2)
slug = combine(gdf,:cumdeaths => diffsm => :dailydeaths) metaj(slug) pr(2)
DataFrame (951900, 2) Any["uid" Int64; "dailydeaths" Union{Missing, Int64}]
Column bind.
coviddeathsm = hcat(coviddeathsm[:,1:14],slug[[:dailydeaths]]); metaj(coviddeathsm) pr(2)
(951900, 15)
Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "population" Int64; "date" Date; "cumdeaths" Int64; "dailydeaths" Union{Missing, Int64}]
The latest cumulative fatalities according to Johns Hopkins COVID Data ...
print(sumna(coviddeathsm.dailydeaths)) pr(2) 230995
Join the cases and deaths dataframes by (uid,date).
start = time() covidfinal = innerjoin(covidcasesm,coviddeathsm[[:uid,:date,:cumdeaths,:d ailydeaths]], on = [:uid => :uid, :date => :date]); metaj(covidfinal) @timend pr(2)
DataFrame
(951900, 16)
Any["uid" Int64; "iso2" String; "iso3" String; "code3" Int64; "fips" Union{Missing, Float64}; "admin2" Union{Missing, String}; "provincestate" String; "countryregion" String; "lat" Float64; "long" Float64; "combinedkey" String; "date" Date; "cumcases" Int64; "dailycases" Union{Missing, Int64}; "cumdeaths" Int64; "dailydeaths" Union{Missing, Int64}]
2.496000051498413
Confirm total cases and fatalities.
println(sumna(covidfinal.dailycases)) print(sumna(covidfinal[:dailyd eaths])) pr(2) 9206974 230995
Define a function to compute rolling or moving averages with dataframe count columns. I’ll consider alternatives later.
function masm_1(a;window=14) a = convert(Array{Float64,1}, a) lena = length(a); b = similar(a) for j in lena:-1:window summ = 0 for jj in j:-1:(j-window+1) summ += a[jj] end b[j] = round(summ/window) end return(cat(fill(missing,window-1),b[window:lena],dims=1)) end pr(2)
ustotal = by(covidfinal, [:date], :dailycases => sumna => :dailycases, :dailydeaths => sumna => :dailydeaths); ustotal["dayofweek"] = Dates.format.(ustotal.date,"e") ustotal[:cumcases] = cumsum(ustotal.dailycases) ustotal[:cumdeaths] = cumsum(ustotal.dailydeaths) ustotal[:macases7] = masm_1(ustotal[:dailycases],wi ndow=7); ustotal[:madeaths7] = masm_1(ustotal[:dailydeaths],w indow=7); ustotal[:macases14] = masm_1(ustotal[:dailycases],wi ndow=14); ustotal[:madeaths14] = masm_1(ustotal[:dailydeaths],w indow=14); metaj(ustotal) pr(2) print(last(ustotal,14)) pr(2)
(285, 10) Any["date" Date; "dailycases" Int64; "dailydeaths" Int64; "dayofweek" String; "cumcases" Int64; "cumdeaths" Int64; "macases7" Union{Missing, Float64}; "madeaths7" Union{Missing, Float64}; "macases14" Union{Missing, Float64}; "madeaths14" Union{Missing, Float64}] 14×10 DataFrame │ Row │ date │ dailycases │ dailydeaths │ dayofweek │ cumcases │ cumdeaths │ macases7 │ madeaths7 │ macases14 │ madeaths14 │ │ │ Date │ Int64 │ Int64 │ String │ Int64 │ Int64 │ Float64? │ Float64? │ Float64? │ Float64? │ ├─────┼────────────┼────────────┼─────────────┼───────────┼─ ─────────┼───────────┼──────── ──┼───────────┼───────────┼─── ─────────┤ │ 1 │ 0020-10-19 │ 58395 │ 445 │ Mon │ 8214508 │ 220122 │ 58425.0 │ 719.0 │ 53991.0 │ 710.0 │ │ 2 │ 0020-10-20 │ 60355 │ 933 │ Tue │ 8274863 │ 221055 │ 59554.0 │ 738.0 │ 55305.0 │ 726.0 │ │ 3 │ 0020-10-21 │ 62776 │ 1125 │ Wed │ 8337639 │ 222180 │ 60017.0 │ 758.0 │ 56194.0 │ 741.0 │ │ 4 │ 0020-10-22 │ 71701 │ 856 │ Thu │ 8409340 │ 223036 │ 61166.0 │ 763.0 │ 57299.0 │ 734.0 │ │ 5 │ 0020-10-23 │ 83747 │ 943 │ Fri │ 8493087 │ 223979 │ 63249.0 │ 772.0 │ 59178.0 │ 730.0 │ │ 6 │ 0020-10-24 │ 83730 │ 914 │ Sat │ 8576817 │ 224893 │ 66990.0 │ 801.0 │ 61255.0 │ 751.0 │ │ 7 │ 0020-10-25 │ 60807 │ 340 │ Sun │ 8637624 │ 225233 │ 68787.0 │ 794.0 │ 62412.0 │ 747.0 │ │ 8 │ 0020-10-26 │ 66798 │ 477 │ Mon │ 8704422 │ 225710 │ 69988.0 │ 798.0 │ 64206.0 │ 759.0 │ │ 9 │ 0020-10-27 │ 73632 │ 986 │ Tue │ 8778054 │ 226696 │ 71884.0 │ 806.0 │ 65719.0 │ 772.0 │ │ 10 │ 0020-10-28 │ 78358 │ 989 │ Wed │ 8856412 │ 227685 │ 74110.0 │ 786.0 │ 67064.0 │ 772.0 │ │ 11 │ 0020-10-29 │ 88521 │ 971 │ Thu │ 8944933 │ 228656 │ 76513.0 │ 803.0 │ 68840.0 │ 783.0 │ │ 12 │ 0020-10-30 │ 99321 │ 1030 │ Fri │ 9044254 │ 229686 │ 78738.0 │ 815.0 │ 70994.0 │ 793.0 │ │ 13 │ 0020-10-31 │ 81227 │ 862 │ Sat │ 9125481 │ 230548 │ 78381.0 │ 808.0 │ 72685.0 │ 804.0 │ │ 14 │ 0020-11-01 │ 81493 │ 447 │ Sun │ 9206974 │ 230995 │ 81336.0 │ 823.0 │ 75062.0 │ 808.0 │
usdayofweek = by(ustotal, [:dayofweek], :dailycases => sumna => :dailycases, :dailydeaths => sumna => :dailydeaths); usdayofweek["pctcases"] = round.(100*usdayofweek.dailycases/sumna(usdayofweek.dailycas es);digits=2); usdayofweek["pctdeaths"] = round.(100*usdayofweek.dailyde aths/sumna(usdayofweek.dailyde aths);digits=2); print(hcat(frequencies(ustotal ,["dayofweek"])[1:2],usdayofwe ek[[2,4,3,5]])) pr(2)
7×6 DataFrame │ Row │ dayofweek │ count │ dailycases │ pctcases │ dailydeaths │ pctdeaths │ │ │ String │ Int64 │ Int64 │ Float64 │ Int64 │ Float64 │ ├─────┼───────────┼───────┼────────────┼──────────┼───────── ────┼───────────┤ │ 1 │ Wed │ 41 │ 1316918 │ 14.3 │ 41693 │ 18.05 │ │ 2 │ Thu │ 41 │ 1428144 │ 15.51 │ 38222 │ 16.55 │ │ 3 │ Fri │ 41 │ 1529602 │ 16.61 │ 37455 │ 16.21 │ │ 4 │ Sat │ 41 │ 1380229 │ 14.99 │ 31917 │ 13.82 │ │ 5 │ Sun │ 41 │ 1178217 │ 12.8 │ 19792 │ 8.57 │ │ 6 │ Mon │ 40 │ 1127424 │ 12.25 │ 22645 │ 9.8 │ │ 7 │ Tue │ 40 │ 1246440 │ 13.54 │ 39271 │ 17.0 │
Save covidfinal to a portable feather file for subsequent use in Julia, Python, and R.
wdir = "/data/covid19/julia" cd(wdir) fname = "covidfinal.feather" Feather.write(fname,covidfinal) pr(2)
Perform calculations used in subsequent graphs.
tit = @sprintf "%s%s to %s\nCases: %s; Deaths: %s\n" "Covid 19 in the U.S.\n" first(ustotal).date last(ustotal).date sumna(ustotal.dailycases) sumna(ustotal.dailydeaths) annodate = last(ustotal).date + Dates.Day(15) annodatestr = string(last(ustotal).date) annocases = last(ustotal).macases7 annocasesstr = last(ustotal).macases7 annodeaths = last(ustotal).madeaths14 annodeathsstr = last(ustotal).madeaths14 pr(2)
Produce Gadfly graphs of case and fatality moving averages.
set_default_plot_size(12cm, 12cm) Gadfly.push_theme(:dark) pr(2)
g = Gadfly.plot(ustotal, x=:date, y=:macases7,
Gadfly.Geom.line,
Gadfly.Scale.y_continuous(minvalue=0,format=:plain),
Guide.XLabel("Month"),
Guide.YLabel("7 day MA cases"),
Guide.Title(tit))
g
g = Gadfly.plot(ustotal, x=:date, y=:madeaths14,
Gadfly.Geom.line,
Gadfly.Scale.y_continuous(minvalue=0,format=:plain),
Guide.XLabel("Month"),
Guide.YLabel("14 day MA deaths"),
Guide.Title(tit))
g

Consider an alternative moving average function adapted from the package RollingFunctions. It’s generally preferable to reuse rather than build from scratch.
function masm_2(vec,win) return(cat(fill(missing,win-1),RollingFunctions.rollmean(vec ,win),dims=1)) end pr(2)
Check.
print(last(DataFrame(macases14_1=round.(masm_2(ustotal.daily cases,14);digits=0),macases14_ 2=ustotal.macases14),14)) pr(2)
14×2 DataFrame │ Row │ macases14_1 │ macases14_2 │ │ │ Float64? │ Float64? │ ├─────┼─────────────┼─────────────┤ │ 1 │ 53991.0 │ 53991.0 │ │ 2 │ 55305.0 │ 55305.0 │ │ 3 │ 56194.0 │ 56194.0 │ │ 4 │ 57299.0 │ 57299.0 │ │ 5 │ 59178.0 │ 59178.0 │ │ 6 │ 61255.0 │ 61255.0 │ │ 7 │ 62412.0 │ 62412.0 │ │ 8 │ 64206.0 │ 64206.0 │ │ 9 │ 65719.0 │ 65719.0 │ │ 10 │ 67064.0 │ 67064.0 │ │ 11 │ 68840.0 │ 68840.0 │ │ 12 │ 70994.0 │ 70994.0 │ │ 13 │ 72685.0 │ 72685.0 │ │ 14 │ 75062.0 │ 75062.0 │
That’s it for now. Next time I’ll explore interoperating Julia with Python and R.
The path to a job in data science may vary. With the Ai+ Training Platform, you gain access to our massive library of data science training courses, workshops, keynotes, and talks. All skills are ideal for those looking to break into the field or to acquire the latest skills needed to get ahead. Some highlighted courses include:
SQL for Data Science: Mona Khalil | Senior Data Scientist | Greenhouse
Data Science in the Industry: Continuous Delivery for Machine Learning with Open-Source Tools: Team from ThoughtWorks, Inc.
How to do Data Science with Missing Data: Matt Brems | Managing Partner, Distinguished Faculty | BetaVector, General Assembly
Continuously Deployed Machine Learning: Max Humber | Lead Instructor | General Assembly