

Which Conference is Best? — College Hoops, Net Rankings and Python
PythonTools & LanguagesPythonSportsposted by Steve Miller January 18, 2019 Steve Miller

For college basketball junkies like me, the season is now shifting into high gear as teams begin serious conference play. At the end of the regular season and conference tournaments, 66 D1 teams — 32 league champions and 34 at large — will receive invitations to March’s national championship tournament.
A year ago, I posted a blog on evaluating the individual sport performance of college athletic conferences. Such an assessment is critical to help determine which at large teams should be invited to championship tournaments.
This year, the NCAA has introduced a new rankings system called NET to inform team rankings as a progression/replacement for the RPI I wrote on earlier. NET “relies on game results, strength of schedule, game location, scoring margin, net offensive and defensive efficiency, and the quality of wins and losses.”
The remainder of this blog uses NET rankings to examine current 2018-2019 D1 conference performance in men’s basketball. For the analysis that follows, technology deployed is JupyterLab 0.32.1, Anaconda Python 3.6.5, and Microsoft R Open R-3.4.4. Of special interest is the demonstration of “RMagic” that interoperates R and its ggplot package within the Python notebook. My take is that the R/Python interoperability is a godsend for analytics programmers.
The data are first scraped from the NET website using the Python requests library, then “liberated” from HTML using BeautifulSoup functionality. The resulting lists are subsequently wrangled using core Python, numpy, and Pandas. In the end, Pandas dataframes are passed to R for a ggplot graph.
Include a personal directory in the Python library search path.
import sys
functdir = "c:/data/jupyter/notebooks/functions"
sys.path.append(functdir)
print("\n\n")
Now import the personal functions.
import myfuncs as my
my.blanks(2)
frequenciesdf and metadf are the important ones.
help(my)
my.blanks(2)
Next import pertinent Python libraries.
import warnings, os, numpy as np, pandas as pd, time, datetime, re
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
from scipy import stats
warnings.filterwarnings('ignore')
my.blanks(2)
Load the rpy2 “magic” R extension
%load_ext rpy2.ipython
my.blanks(2)
Finally, import other rpy2 libraries and activate the Pandas to R dataframe copy capability.
import rpy2
import rpy2.robjects.numpy2ri
import rpy2.robjects as robjects
robjects.pandas2ri.activate()
my.blanks(2)
Define functions to handle HTTP get requests. The code below was shamelessly “borrowed”. My gratitude to the author.
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)
def log_error(e):
"""
It is always a good idea to log errors.
This function just prints them, but you can
make it do anything.
"""
print(e)
my.blanks(2)
Read the NET raw HTML and “soupify” to text.
url = "https://www.ncaa.com/rankings/basketball-men/d1/ncaa-mens-basketball-net-rankings"
raw_html = simple_get(url)
soup = BeautifulSoup(raw_html, 'html5lib')
my.blanks(2)
Examine the soup text to identify “tags” that enclose data of interest. In this case the lowest level relevant tags are ‘figure’, ‘th’, and ‘td’.
print(str(soup)[40000:44000])
my.blanks(2)
Assemble the date, variable name, and the table data from the “souped” query results into “clean” Python lists.
dte = [s.text for s in soup.select('figure')]
dte = [h.replace('\n', '').replace('.',"").lstrip() for h in dte][0]
print(dte)
my.blanks(2)
vars = [s.text for s in soup.select('th')]
vars = [v.replace(' ', '').lower() for v in vars]
lvars = len(vars)
print(vars)
my.blanks(2)
dta = [s.text for s in soup.select('td')]
ldta = len(dta)
print(dta[0:18])
my.blanks(2)
Reshape the scraped table data into a numpy array.
ndta = np.array(dta).reshape(int((ldta/lvars)),lvars)
my.blanks(2)
print(ndta)
my.blanks(2)
Copy that array into a Pandas dataframe and change the data types of two resulting attributes.
df = pd.DataFrame(ndta,columns=vars)
df['rank'] = df['rank'].astype('int16')
df['previous'] = df['previous'].astype('int16')
my.metadf(df)
my.blanks(2)
From the resulting dataframe, tally the frequencies of “conference” in descending frequency order.
my.frequenciesdf(df,['conference'])
Define a function that produces a score to rank conference performance — lower is better. Use the Pandas “groupby-apply” split-apply-combine functionality.
def qfnct(x):
N = len(x)
q = sum(np.percentile(x,[0,25,50])) + sum(np.percentile(x,[0,25,50,75]))
y = pd.DataFrame({'N':N,'rankq':q},index=[0],dtype='int32')
return y
my.blanks(2)
Compute the conference rankings and then join them to the team dataframe. Build another dataframe to pass parameters to the next-step R ggplot visualizations. Make R versions of these dataframes.
dfconf = df.groupby(['conference'])['rank'].apply(qfnct)
dfconf.sort_values(['rankq'], ascending=True, inplace=True)
dfconf.reset_index(inplace=True)
dfconf.drop(['level_1'], axis=1, inplace=True)
print(my.metadf(dfconf))
dffinal = df.merge(dfconf, left_on='conference', right_on='conference', how='inner')
dffinal.sort_values(['rankq','rank'], ascending=True, inplace=True)
dffinal.reset_index(inplace=True)
dffinal.drop(['index'], axis=1, inplace=True)
print(my.metadf(dffinal))
parms = pd.DataFrame({'title':["Men's Division I Basketball NET Rankings"],
'stitle':[dte],'xlab':["\nConference"],'ylab':["NET Rank\n"]})
print(my.metadf(parms))
rhoops = robjects.pandas2ri.py2ri(dffinal)
rparms = robjects.pandas2ri.py2ri(parms)
my.blanks(2)
Use RMagic to load pertinent R packages.
%R require(tidyverse); require(data.table); require(RColorBrewer); require(R.utils)
Pass the R dataframes to ggplot for graphing. At this time, the Big Ten, Big 12, ACC, and SEC are my top conferences.
%%R -w700 -h700 -i rhoops -i rparms
options(scipen=10)
pal <- brewer.pal(9,"Blues")
howmany <- 12
rhoops <- data.table(rhoops)
rhoops[,conference := fct_reorder(conference, rankq)]
lab <- rhoops[,.(rankq=rankq[1]),by=conference][order(rankq)]
pal <- brewer.pal(9,"Blues")
g <- ggplot(rhoops[conference %in% levels(rhoops$conference)[1:howmany]],
aes(x=conference,y=rank,fill=conference)) +
geom_violin(trim = FALSE,draw_quantiles = c(0.25,0.5,0.75)) +
theme(legend.position = "none", plot.background = element_rect(fill = pal[2]),
panel.background = element_rect(fill = pal[2])) +
geom_hline(aes(yintercept=median(rank, na.rm=T)),
color="black", linetype="dashed", size=.25) +
theme(axis.text.x = element_text(size=7, angle = 45, hjust = 1)) +
theme(strip.text.x = element_text(size = 5)) +
geom_jitter(height = 1, width = .1, size=1) +
scale_x_discrete(labels=with(lab,paste(as.character(conference),round(rankq),sep=" -- ")[1:howmany])) +
labs(title=rparms$title,
subtitle=rparms$stitle, y=rparms$ylab, x=rparms$xlab)
print(g)
Finally, list rankings of schools in the top three conferences.
cols = dffinal.columns[0:5]
print(dffinal.loc[dffinal.conference=="Big Ten",cols])
my.blanks(2)
print(dffinal.loc[dffinal.conference=="Big 12",cols])
my.blanks(2)
print(dffinal.loc[dffinal.conference=="ACC",cols])
my.blanks(2)
That’s it for now. Happy New Year to ODSC readers!