

Jupyter Notebook: Python or R—Or Both?
PythonRTools & LanguagesPythonRposted by Steve Miller June 13, 2019 Steve Miller

I was analytically betwixt and between a few weeks ago. Most of my Jupyter Notebook work is done in either Python or R. Indeed, I like to self-demonstrate the power of each platform by recoding R work in Python and vice-versa.
I must have a dozen active notebooks, some developed in Python and some in R, that investigate the performance of the stock market, using indexes like the S&P 500, Wilshire 5000, and Russell 3000. These index data can be readily downloaded from the web, using core functionality of either of them, for subsequent analytic processing and display. I’ve learned a lot about both programs, and the stock market developing such notebooks. Alas, for the latest exercise I’d envisioned, I wished to look at Russell 3000 data, available in a Python notebook, through the lens of analytics I’d coded for the Wilshire 5000 in R. The mishmash of notebook kernel and stock market index combinations I possessed didn’t meet my needs.
[Related Article: Snakes in a Package: Combining Python and R with Reticulate]
I figured I had a couple of choices. I could either re-write the R graphics code in Python using the ever-improving Seaborn library, or I could adapt the R ggplot Wilshire code to interoperate with Russell 3000 data using the Python library rpy2 and RMagic—in effect engaging R within Python. I decided to do the latter.
R and Python are two of the top analytics platforms. Both are open source, both have large user bases, and both have incredibly productive ecosystems. In addition, they interoperate: Python developers can use the rpy2 library to include R code in their scripts, while R developers have access to Python via the reticulate package. There’s also the feather package, available in both, for sharing data across platforms. I fully expect even more seamless collaboration between them in the near future.
For the analysis that follows, I focus on the performance of the FTSE Russell 3000 index using Python. I first download two files—a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data to build the final Pandas dataframe. From there, I build R dataframes to show the growth of $1 invested over time suitable for ggplot2 charting.
[Related Article: Introduction to R Shiny]
The technology used below is JupyterLab 0.32.1, Anaconda Python 3.6.5, Pandas 0.23.0, R 3.6.0, and rpy2 2.9.4. This notebook’s kernel is Python 3 and uses the rpy2 library to enable R processing. In this instance, the initial data work is done in Python/Pandas, then handed off for graphics to the splendid R ggplot2 library.
Tone down warnings on pandas filters.
import warnings
warnings.filterwarnings("ignore")
print("\n\n")
Add a local directory housing a personal library to the import path.
import sys
functdir = "c:/data/jupyter/notebooks/functions"
sys.path.append(functdir)
print("\n\n")
Load the personal library. The prarr and blanks functions are used in this article.
from myfuncs import *
blanks(2)
Import other relevant Python libraries.
import os
import pandas as pd
import numpy as np
blanks(2)
Set and migrate to the new working directory.
owd = os.getcwd()
nwd = "c:/data/russell/potus"
os.chdir(nwd)
print(os.getcwd())
blanks(2)
Establish a list of url’s of the CSV files to be downloaded from the FTSE Russell website. The two files are year-to-date (ytd) and history (hist) of the daily Russell 3000 index levels.
urllst = [
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valuesytd_US3000.csv",
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valueshist_US3000.csv"
]
blanks(2)
Load the ytd file into a pandas dataframe. Do a little munging.
ytd = pd.read_csv(urllst[0])
ytd.columns = ["portfolio","date","levwodiv","levwdiv"]
ytd['portfolio'] = "r3000"
ytd['date'] = pd.DatetimeIndex(ytd.date)
ytd = ytd.sort_values(['date']).iloc[1:,]
ytd.reset_index(drop=True,inplace=True)
prarr(ytd)
blanks(2)
Ditto for hist.
hist = pd.read_csv(urllst[1])
hist.columns = ["portfolio","date","levwodiv","levwdiv"]
hist['portfolio'] = "r3000"
hist['date'] = pd.DatetimeIndex(hist.date)
hist.sort_values(['date'],inplace=True)
hist.reset_index(drop=True,inplace=True)
prarr(hist)
blanks(2)
Next, vertically “stack” hist and ytd into a “combine” dataframe.
combine = pd.concat([ytd,hist])
combine.sort_values(['date'],inplace=True)
print(combine.shape,"\n\n")
prarr(combine)
blanks(2)
Delete duplicate level records that emerge from holidays.
combine.drop_duplicates(subset=['levwodiv','levwdiv'], inplace=True)
print(combine.shape,"\n\n")
prarr(combine)
blanks(2)
Compute percent change attributes for both level with dividends and level without dividends.
combine['pctwodiv'] = combine.levwodiv.pct_change()
combine['pctwdiv'] = combine.levwdiv.pct_change()
print(combine.shape,"\n\n")
prarr(combine)
blanks(2)
Write a csv file from the final dataframe. Calculate the growth of $1 from the inception of the data.
out = "r3000pd.csv"
combine.to_csv(out,index=None, sep=',')
print(round((1+combine.pctwodiv).prod(),2))
print(round((1+combine.pctwdiv).prod(),2))
blanks(2)
Load the rpy2 (R within Python) module
%load_ext rpy2.ipython
blanks(2)
Import pertinent rpy2 libraries.
import rpy2
import rpy2.robjects.numpy2ri
import rpy2.robjects as robjects
robjects.pandas2ri.activate()
blanks(2)
Create a version of the Pandas combine dataframe suitable for R processing.
r3000 = robjects.pandas2ri.py2ri(combine[['date','pctwdiv']])
blanks(2)
Take a peek at the R data.frame.
%R -i r3000 tail(r3000)
Load relevant R libraries.
%R require(tidyverse); require(data.table); require(RColorBrewer); require(R.utils); require(lubridate)
Create a cell of R processing. Push in the r3000 dataframe and commence R wrangling and graphics processing. Display the final chart.
%%R -w700 -h700 -i r3000
r3000 <- data.table(r3000)
mdte <- max(r3000[['date']])
dte <- substr(mdte,6,11)
tdates <- lubridate::date(paste(c("2019-","2011-","2015-"),dte,sep=""))
fdates <- lubridate::date(c("2017-01-20","2009-01-20","2013-01-21"))
#############################################
# function to build the data.table for ggplot.
#############################################
nmkreturn <- function(to,from,dt)
{
rbind(
data.table(potus='Trump',
date=dt[date>=fdates[1] & date<=tdates[1]]$date,
returnpct=dt[date>=fdates[1] & date<=tdates[1],cumprod(1+pctwdiv)]
),
data.table(potus='Obama 1',
date=dt[date>=fdates[2] & date<=tdates[2]]$date,
returnpct=dt[date>=fdates[2] & date<=tdates[2],cumprod(1+pctwdiv)]
),
data.table(potus='Obama 2',
date=dt[date>=fdates[3] & date<=tdates[3]]$date,
returnpct=dt[date>=fdates[3] & date<=tdates[3],cumprod(1+pctwdiv)]
)
)
}
#########################################
# nwork is the final graphics data.table.
#########################################
work <- nmkreturn(tdates,fdates,r3000)
nwork <- data.table(left_join(work,work[,.(legend=paste(potus,date[1], date[.N], "\n",sep="\n")),.(potus)],by=c("potus"="potus")))
X <- nwork[,.(date[.N],returnpct=round(100*(returnpct[.N])-100)),.(potus)]$V1
Y <- nwork[,.(date[.N],returnpct=round(100*(returnpct[.N])-100)),.(potus)]$returnpct
###############################
# save data to an rds data set.
###############################
ofile = "r3000.rds"
save(r3000,X,work,nwork,file=ofile)
###########################################
# set parm vars and execute the ggplot code.
###########################################
titstr <- paste("Russell 3000 Returns", " thru ", mdte,sep="")
stitstr <- "Trump vs Obama\n"
xstr <- "\nYear"
ystr <- "Growth %\n"
cstr <- "Administration\n"
pal <- brewer.pal(9,"Blues")
g <- ggplot(nwork,aes(x=date,y=100*(returnpct-1), col=legend)) +
geom_line(size=.7) +
theme(legend.position = "right", plot.background = element_rect(fill = pal[2]),
panel.background = element_rect(fill = pal[2])) +
theme(axis.text.x = element_text(size=10, angle = 45)) +
labs(title=titstr,subtitle=stitstr,x=xstr,y=ystr,col=cstr) +
annotate("text", x = X+100*24*60*60, y = Y, label = Y, size=4)
g
28 month stock market performance is solid for 45. Comparable period performance is even better for each administration of 44.