fbpx
Jupyter Notebook: Python or R—Or Both? Jupyter Notebook: Python or R—Or Both?
I was analytically betwixt and between a few weeks ago. Most of my Jupyter Notebook work is done in either Python... Jupyter Notebook: Python or R—Or Both?

I was analytically betwixt and between a few weeks ago. Most of my Jupyter Notebook work is done in either Python or R. Indeed, I like to self-demonstrate the power of each platform by recoding R work in Python and vice-versa.

I must have a dozen active notebooks, some developed in Python and some in R, that investigate the performance of the stock market, using indexes like the S&P 500, Wilshire 5000, and Russell 3000. These index data can be readily downloaded from the web, using core functionality of either of them, for subsequent analytic processing and display. I’ve learned a lot about both programs, and the stock market developing such notebooks. Alas, for the latest exercise I’d envisioned, I wished to look at Russell 3000 data, available in a Python notebook, through the lens of analytics I’d coded for the Wilshire 5000 in R. The mishmash of notebook kernel and stock market index combinations I possessed didn’t meet my needs.

[Related Article: Snakes in a Package: Combining Python and R with Reticulate]

I figured I had a couple of choices. I could either re-write the R graphics code in Python using the ever-improving Seaborn library, or I could adapt the R ggplot Wilshire code to interoperate with Russell 3000 data using the Python library rpy2 and RMagic—in effect engaging R within Python. I decided to do the latter.

R and Python are two of the top analytics platforms. Both are open source, both have large user bases, and both have incredibly productive ecosystems. In addition, they interoperate: Python developers can use the rpy2 library to include R code in their scripts, while R developers have access to Python via the reticulate package. There’s also the feather package, available in both, for sharing data across platforms. I fully expect even more seamless collaboration between them in the near future.

For the analysis that follows, I focus on the performance of the FTSE Russell 3000 index using Python. I first download two files—a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data to build the final Pandas dataframe. From there, I build R dataframes to show the growth of $1 invested over time suitable for ggplot2 charting.

[Related Article: Introduction to R Shiny]

The technology used below is JupyterLab 0.32.1, Anaconda Python 3.6.5, Pandas 0.23.0, R 3.6.0, and rpy2 2.9.4. This notebook’s kernel is Python 3 and uses the rpy2 library to enable R processing. In this instance, the initial data work is done in Python/Pandas, then handed off for graphics to the splendid R ggplot2 library.

Tone down warnings on pandas filters.

In [1]:
import warnings
warnings.filterwarnings("ignore")
print("\n\n")

Add a local directory housing a personal library to the import path.

In [2]:
import sys
functdir = "c:/data/jupyter/notebooks/functions"
sys.path.append(functdir)
print("\n\n")

Load the personal library. The prarr and blanks functions are used in this article.

In [3]:
from myfuncs import *
blanks(2)

Import other relevant Python libraries.

In [4]:
import os
import pandas as pd
import numpy as np
blanks(2)

Set and migrate to the new working directory.

In [5]:
owd = os.getcwd()
nwd = "c:/data/russell/potus"
os.chdir(nwd)
print(os.getcwd())
blanks(2)
c:\data\russell\potus

Establish a list of url’s of the CSV files to be downloaded from the FTSE Russell website. The two files are year-to-date (ytd) and history (hist) of the daily Russell 3000 index levels.

In [6]:
urllst = [
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valuesytd_US3000.csv",
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valueshist_US3000.csv"
]
blanks(2)

Load the ytd file into a pandas dataframe. Do a little munging.

In [7]:
ytd = pd.read_csv(urllst[0])
ytd.columns = ["portfolio","date","levwodiv","levwdiv"]
ytd['portfolio'] = "r3000"
ytd['date'] = pd.DatetimeIndex(ytd.date)
ytd = ytd.sort_values(['date']).iloc[1:,]
ytd.reset_index(drop=True,inplace=True)
prarr(ytd)
blanks(2)
  portfolio       date     levwodiv      levwdiv
0     r3000 2019-01-01  2704.369326  7733.105730
1     r3000 2019-01-02  2707.170883  7741.158427
2     r3000 2019-01-03  2643.051723  7559.582582
3     r3000 2019-01-04  2734.493258  7821.127295
4     r3000 2019-01-07  2757.993050  7888.401675 
    portfolio       date     levwodiv      levwdiv
118     r3000 2019-06-14  3117.538895  8996.253148
119     r3000 2019-06-17  3122.143259  9009.636226
120     r3000 2019-06-18  3153.227920  9099.614008
121     r3000 2019-06-19  3163.499294  9129.317529
122     r3000 2019-06-20  3191.645509  9211.282316

Ditto for hist.

In [8]:
hist = pd.read_csv(urllst[1])
hist.columns = ["portfolio","date","levwodiv","levwdiv"]
hist['portfolio'] = "r3000"
hist['date'] = pd.DatetimeIndex(hist.date)
hist.sort_values(['date'],inplace=True)
hist.reset_index(drop=True,inplace=True)
prarr(hist)
blanks(2)
  portfolio       date     levwodiv      levwdiv
0     r3000 2005-01-03  1261.856142  2742.745480
1     r3000 2005-01-04  1245.475031  2707.401151
2     r3000 2005-01-05  1238.467380  2692.398250
3     r3000 2005-01-06  1242.776350  2702.305567
4     r3000 2005-01-07  1240.174458  2696.663529 
     portfolio       date     levwodiv      levwdiv
3645     r3000 2018-12-25  2536.218945  7248.537164
3646     r3000 2018-12-26  2662.170289  7608.685862
3647     r3000 2018-12-27  2682.958961  7668.843455
3648     r3000 2018-12-28  2681.191917  7665.856863
3649     r3000 2018-12-31  2704.369326  7733.105730

Next, vertically “stack” hist and ytd into a “combine” dataframe.

In [9]:
combine = pd.concat([ytd,hist])
combine.sort_values(['date'],inplace=True)
print(combine.shape,"\n\n")
prarr(combine)
blanks(2)
(3773, 4) 

  portfolio       date     levwodiv      levwdiv
0     r3000 2005-01-03  1261.856142  2742.745480
1     r3000 2005-01-04  1245.475031  2707.401151
2     r3000 2005-01-05  1238.467380  2692.398250
3     r3000 2005-01-06  1242.776350  2702.305567
4     r3000 2005-01-07  1240.174458  2696.663529 
    portfolio       date     levwodiv      levwdiv
118     r3000 2019-06-14  3117.538895  8996.253148
119     r3000 2019-06-17  3122.143259  9009.636226
120     r3000 2019-06-18  3153.227920  9099.614008
121     r3000 2019-06-19  3163.499294  9129.317529
122     r3000 2019-06-20  3191.645509  9211.282316

Delete duplicate level records that emerge from holidays.

In [10]:
combine.drop_duplicates(subset=['levwodiv','levwdiv'], inplace=True)
print(combine.shape,"\n\n")
prarr(combine)
blanks(2)
(3640, 4) 

  portfolio       date     levwodiv      levwdiv
0     r3000 2005-01-03  1261.856142  2742.745480
1     r3000 2005-01-04  1245.475031  2707.401151
2     r3000 2005-01-05  1238.467380  2692.398250
3     r3000 2005-01-06  1242.776350  2702.305567
4     r3000 2005-01-07  1240.174458  2696.663529 
    portfolio       date     levwodiv      levwdiv
118     r3000 2019-06-14  3117.538895  8996.253148
119     r3000 2019-06-17  3122.143259  9009.636226
120     r3000 2019-06-18  3153.227920  9099.614008
121     r3000 2019-06-19  3163.499294  9129.317529
122     r3000 2019-06-20  3191.645509  9211.282316

Compute percent change attributes for both level with dividends and level without dividends.

In [11]:
combine['pctwodiv'] = combine.levwodiv.pct_change()
combine['pctwdiv'] = combine.levwdiv.pct_change()
print(combine.shape,"\n\n")
prarr(combine)
blanks(2)
(3640, 6) 

  portfolio       date     levwodiv      levwdiv  pctwodiv   pctwdiv
0     r3000 2005-01-03  1261.856142  2742.745480       NaN       NaN
1     r3000 2005-01-04  1245.475031  2707.401151 -0.012982 -0.012886
2     r3000 2005-01-05  1238.467380  2692.398250 -0.005626 -0.005541
3     r3000 2005-01-06  1242.776350  2702.305567  0.003479  0.003680
4     r3000 2005-01-07  1240.174458  2696.663529 -0.002094 -0.002088 
    portfolio       date     levwodiv      levwdiv  pctwodiv   pctwdiv
118     r3000 2019-06-14  3117.538895  8996.253148 -0.002559 -0.002440
119     r3000 2019-06-17  3122.143259  9009.636226  0.001477  0.001488
120     r3000 2019-06-18  3153.227920  9099.614008  0.009956  0.009987
121     r3000 2019-06-19  3163.499294  9129.317529  0.003257  0.003264
122     r3000 2019-06-20  3191.645509  9211.282316  0.008897  0.008978

Write a csv file from the final dataframe. Calculate the growth of $1 from the inception of the data.

In [12]:
out = "r3000pd.csv"
combine.to_csv(out,index=None, sep=',')
print(round((1+combine.pctwodiv).prod(),2))
print(round((1+combine.pctwdiv).prod(),2))
blanks(2)
2.53
3.36

Load the rpy2 (R within Python) module

In [13]:
%load_ext rpy2.ipython
blanks(2)

Import pertinent rpy2 libraries.

In [14]:
import rpy2                    
import rpy2.robjects.numpy2ri  
import rpy2.robjects as robjects
robjects.pandas2ri.activate()
blanks(2)

Create a version of the Pandas combine dataframe suitable for R processing.

In [15]:
r3000 = robjects.pandas2ri.py2ri(combine[['date','pctwdiv']])
blanks(2)

Take a peek at the R data.frame.

In [16]:
%R -i r3000 tail(r3000)
Out[16]:
datepctwdiv
02019-06-13 00:00:00-05:000.004958
12019-06-14 00:00:00-05:00-0.002440
22019-06-17 00:00:00-05:000.001488
32019-06-18 00:00:00-05:000.009987
42019-06-19 00:00:00-05:000.003264
52019-06-20 00:00:00-05:000.008978

Load relevant R libraries.

In [17]:
%R require(tidyverse); require(data.table); require(RColorBrewer); require(R.utils); require(lubridate)
Out[17]:
array([1], dtype=int32)

Create a cell of R processing. Push in the r3000 dataframe and commence R wrangling and graphics processing. Display the final chart.

In [18]:
%%R  -w700 -h700 -i r3000 
r3000 <- data.table(r3000)
mdte <- max(r3000[['date']])
dte <- substr(mdte,6,11)
tdates <- lubridate::date(paste(c("2019-","2011-","2015-"),dte,sep=""))
fdates <- lubridate::date(c("2017-01-20","2009-01-20","2013-01-21"))
#############################################
# function to build the data.table for ggplot.
#############################################
nmkreturn <- function(to,from,dt)
{
    
    rbind(
    data.table(potus='Trump',
           date=dt[date>=fdates[1] & date<=tdates[1]]$date,
           returnpct=dt[date>=fdates[1] & date<=tdates[1],cumprod(1+pctwdiv)]
          ),  
    data.table(potus='Obama 1',
           date=dt[date>=fdates[2] & date<=tdates[2]]$date,
           returnpct=dt[date>=fdates[2] & date<=tdates[2],cumprod(1+pctwdiv)]
          ),
    data.table(potus='Obama 2',
            date=dt[date>=fdates[3] & date<=tdates[3]]$date,
            returnpct=dt[date>=fdates[3] & date<=tdates[3],cumprod(1+pctwdiv)]
          )  
    )     
}
#########################################
# nwork is the final graphics data.table.
#########################################
work <- nmkreturn(tdates,fdates,r3000)
nwork <- data.table(left_join(work,work[,.(legend=paste(potus,date[1], date[.N], "\n",sep="\n")),.(potus)],by=c("potus"="potus")))

X <- nwork[,.(date[.N],returnpct=round(100*(returnpct[.N])-100)),.(potus)]$V1
Y <- nwork[,.(date[.N],returnpct=round(100*(returnpct[.N])-100)),.(potus)]$returnpct
###############################
# save data to an rds data set.
###############################
ofile = "r3000.rds"
save(r3000,X,work,nwork,file=ofile)
 
###########################################
# set parm vars and execute the ggplot code.
###########################################
titstr <- paste("Russell 3000 Returns", " thru ", mdte,sep="")
stitstr <- "Trump vs Obama\n"
xstr <- "\nYear"
ystr <- "Growth %\n"
cstr <- "Administration\n"
pal <- brewer.pal(9,"Blues")
g <- ggplot(nwork,aes(x=date,y=100*(returnpct-1), col=legend)) +
      geom_line(size=.7) +
      theme(legend.position = "right", plot.background = element_rect(fill = pal[2]), 
      panel.background = element_rect(fill = pal[2])) +
      theme(axis.text.x = element_text(size=10, angle = 45)) +
      labs(title=titstr,subtitle=stitstr,x=xstr,y=ystr,col=cstr) + 
      annotate("text", x = X+100*24*60*60, y = Y, label = Y, size=4) 
g

28 month stock market performance is solid for 45. Comparable period performance is even better for each administration of 44.

Steve Miller

Steve Miller

Steve is a data scientist focusing on research methods, data integration, statistics/ML, and computational analytics in R/python. Follow him on LinkedIn: https://www.linkedin.com/in/steve-miller-58ab881/

1