fbpx
Which Conference is Best? — College Hoops, Net Rankings and Python Which Conference is Best? — College Hoops, Net Rankings and Python
For college basketball junkies like me, the season is now shifting into high gear as teams begin serious conference play. At... Which Conference is Best? — College Hoops, Net Rankings and Python

For college basketball junkies like me, the season is now shifting into high gear as teams begin serious conference play. At the end of the regular season and conference tournaments, 66 D1 teams — 32 league champions and 34 at large — will receive invitations to March’s national championship tournament.

A year ago, I posted a blog on evaluating the individual sport performance of college athletic conferences. Such an assessment is critical to help determine which at large teams should be invited to championship tournaments.

This year, the NCAA has introduced a new rankings system called NET to inform team rankings as a progression/replacement for the RPI I wrote on earlier. NET “relies on game results, strength of schedule, game location, scoring margin, net offensive and defensive efficiency, and the quality of wins and losses.”

The remainder of this blog uses NET rankings to examine current 2018-2019 D1 conference performance in men’s basketball. For the analysis that follows, technology deployed is JupyterLab 0.32.1, Anaconda Python 3.6.5, and Microsoft R Open R-3.4.4. Of special interest is the demonstration of “RMagic” that interoperates R and its ggplot package within the Python notebook. My take is that the R/Python interoperability is a godsend for analytics programmers.

The data are first scraped from the NET website using the Python requests library, then “liberated” from HTML using BeautifulSoup functionality. The resulting lists are subsequently wrangled using core Python, numpy, and Pandas. In the end, Pandas dataframes are passed to R for a ggplot graph.

Include a personal directory in the Python library search path.

In [1]:
import sys
functdir = "c:/data/jupyter/notebooks/functions"
sys.path.append(functdir)
print("\n\n")

Now import the personal functions.

In [2]:
import myfuncs as my
my.blanks(2)

frequenciesdf and metadf are the important ones.

In [3]:
help(my)
my.blanks(2)
Help on module myfuncs:
NAME
    myfuncs
FUNCTIONS
    blanks lambda n
    
    frequenciesdf(df, fvar)
        (df - pandas dataframe; fvar - list of dataframe columns)
    
    metadf(df)
        (df - pandas dataframe)
    
    tally lambda df
FILE
    c:\data\jupyter\notebooks\functions\myfuncs.py


Next import pertinent Python libraries.

In [4]:
import warnings, os, numpy as np, pandas as pd, time, datetime, re
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
from scipy import stats
warnings.filterwarnings('ignore')
my.blanks(2)
In [5]:
%load_ext rpy2.ipython
my.blanks(2)

Finally, import other rpy2 libraries and activate the Pandas to R dataframe copy capability.

In [6]:
import rpy2                    
import rpy2.robjects.numpy2ri  
import rpy2.robjects as robjects
robjects.pandas2ri.activate()
my.blanks(2)

Define functions to handle HTTP get requests. The code below was shamelessly “borrowed”. My gratitude to the author.

In [7]:
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None
    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)

def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)
    
my.blanks(2)

Read the NET raw HTML and “soupify” to text.

In [8]:
url = "https://www.ncaa.com/rankings/basketball-men/d1/ncaa-mens-basketball-net-rankings"
raw_html = simple_get(url)
soup = BeautifulSoup(raw_html, 'html5lib')
my.blanks(2)

Examine the soup text to identify “tags” that enclose data of interest. In this case the lowest level relevant tags are ‘figure’, ‘th’, and ‘td’.

In [9]:
print(str(soup)[40000:44000])
my.blanks(2)
uild_id" type="hidden" value="form-fkbx76zd7khbehvRqG6UcrqhfEUIWpKdbBGAlO-3toE"/>
<input data-drupal-selector="edit-rankings-menu-form" name="form_id" type="hidden" value="rankings_menu_form"/>
</form>
<figure class="rankings-last-updated">
    Through Games JAN. 10, 2019
</figure>
<div class="layout--has-sidebar">
    <article class="rankings-content overflowable-table-region layout--content-left">
        <table class="sticky">
            <thead>
                <tr>
                                    <th>Rank</th>
                                    <th>Previous</th>
                                    <th>School</th>
                                    <th>Conference</th>
                                    <th>Record</th>
                                    <th>Road</th>
                                    <th>Neutral</th>
                                    <th>Home</th>
                                    <th>Non Div I</th>
                                </tr>
            </thead>
            <tbody>
                            <tr>
                                    <td>1</td>
                                    <td>1</td>
                                    <td>Virginia</td>
                                    <td>ACC</td>
                                    <td>14-0</td>
                                    <td>3-0</td>
                                    <td>3-0</td>
                                    <td>8-0</td>
                                    <td>0-0</td>
                                </tr>
                            <tr>
                                    <td>2</td>
                                    <td>2</td>
                                    <td>Duke</td>
                                    <td>ACC</td>
                                    <td>13-1</td>
                                    <td>1-0</td>
                                    <td>4-1</td>
                                    <td>8-0</td>
                                    <td>0-0</td>
                                </tr>
                            <tr>
                                    <td>3</td>
                                    <td>3</td>
                                    <td>Michigan</td>
                                    <td>Big Ten</td>
                                    <td>16-0</td>
                                    <td>3-0</td>
                                    <td>2-0</td>
                                    <td>11-0</td>
                                    <td>0-0</td>
                                </tr>
                            <tr>
                                    <td>4</td>
                                    <td>4</td>
                                    <td>Texas Tech</td>
                                    <td>Big 12</td>
                                    <td>14-1</td>
                                    <td>1-0</td>
                                    <td>3-1</td>
                                    <td>10-0</td>
                                    <td>0-0</td>
                                </tr>
                            <tr>
                                    <td>5</td>
                                    <td>5</td>
                                    <td>Tennessee</td>
                                    <td>SEC</td>
                                    <td>13-1</td>
                                    <td>2-0</td>
                                    <td>2-1</td>
                                    <td>8-0</td>
                                    <td>1-0</td>
                                </tr>
                            <tr>
                                    <td>6</td>
                                    <td>6</td>
                                    <td>Gonzaga</td>
                                    <td>WCC</td>
                                    <td>15-2</td>
                                    <td>1-1</td>
                                    <td>3-1</td>
                  

Assemble the date, variable name, and the table data from the “souped” query results into “clean” Python lists.

In [10]:
dte = [s.text for s in soup.select('figure')]
dte = [h.replace('\n', '').replace('.',"").lstrip() for h in dte][0]
print(dte)
my.blanks(2)
vars = [s.text for s in soup.select('th')]
vars = [v.replace(' ', '').lower() for v in vars]
lvars = len(vars)
print(vars)
my.blanks(2)
dta = [s.text for s in soup.select('td')]
ldta = len(dta)
print(dta[0:18])
my.blanks(2)
Through Games JAN 10, 2019

['rank', 'previous', 'school', 'conference', 'record', 'road', 'neutral', 'home', 'nondivi']

['1', '1', 'Virginia', 'ACC', '14-0', '3-0', '3-0', '8-0', '0-0', '2', '2', 'Duke', 'ACC', '13-1', '1-0', '4-1', '8-0', '0-0']

Reshape the scraped table data into a numpy array.

In [19]:
ndta = np.array(dta).reshape(int((ldta/lvars)),lvars)
my.blanks(2)
print(ndta)
my.blanks(2)
[['1' '1' 'Virginia' ... '3-0' '8-0' '0-0']
 ['2' '2' 'Duke' ... '4-1' '8-0' '0-0']
 ['3' '3' 'Michigan' ... '2-0' '11-0' '0-0']
 ...
 ['351' '351' 'South Carolina St.' ... '0-0' '0-2' '2-0']
 ['352' '352' 'UMES' ... '0-0' '0-3' '2-0']
 ['353' '353' 'UNC Asheville' ... '0-2' '0-3' '2-1']]

Copy that array into a Pandas dataframe and change the data types of two resulting attributes.

In [12]:
df = pd.DataFrame(ndta,columns=vars)
df['rank'] = df['rank'].astype('int16')
df['previous'] = df['previous'].astype('int16')
my.metadf(df)
my.blanks(2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353 entries, 0 to 352
Data columns (total 9 columns):
rank          353 non-null int16
previous      353 non-null int16
school        353 non-null object
conference    353 non-null object
record        353 non-null object
road          353 non-null object
neutral       353 non-null object
home          353 non-null object
nondivi       353 non-null object
dtypes: int16(2), object(7)
memory usage: 20.8+ KB
None

   rank  previous      school conference record road neutral  home nondivi
0     1         1    Virginia        ACC   14-0  3-0     3-0   8-0     0-0
1     2         2        Duke        ACC   13-1  1-0     4-1   8-0     0-0
2     3         3    Michigan    Big Ten   16-0  3-0     2-0  11-0     0-0
3     4         4  Texas Tech     Big 12   14-1  1-0     3-1  10-0     0-0
4     5         5   Tennessee        SEC   13-1  2-0     2-1   8-0     1-0

From the resulting dataframe, tally the frequencies of “conference” in descending frequency order.

In [13]:
my.frequenciesdf(df,['conference'])
Out[13]:
conferencefrequencypercent
0ACC154.249292
1SEC143.966006
2C-USA143.966006
3Big Ten143.966006
4Atlantic 10143.966006
5Southland133.682720
6Sun Belt123.399433
7Pac-12123.399433
8OVC123.399433
9MEAC123.399433
10MAC123.399433
11AAC123.399433
12MWC113.116147
13MAAC113.116147
14Big South113.116147
15Big Sky113.116147
16WCC102.832861
17SoCon102.832861
18SWAC102.832861
19Patriot102.832861
20NEC102.832861
21MVC102.832861
22Horizon102.832861
23CAA102.832861
24Big East102.832861
25Big 12102.832861
26WAC92.549575
27Summit League92.549575
28Big West92.549575
29America East92.549575
30ASUN92.549575
31Ivy League82.266289

Define a function that produces a score to rank conference performance — lower is better. Use the Pandas “groupby-apply” split-apply-combine functionality.

In [14]:
def qfnct(x):
    N = len(x)
    q = sum(np.percentile(x,[0,25,50])) + sum(np.percentile(x,[0,25,50,75]))
    y = pd.DataFrame({'N':N,'rankq':q},index=[0],dtype='int32')
    return y
my.blanks(2)

Compute the conference rankings and then join them to the team dataframe. Build another dataframe to pass parameters to the next-step R ggplot visualizations. Make R versions of these dataframes.

In [15]:
dfconf = df.groupby(['conference'])['rank'].apply(qfnct)
dfconf.sort_values(['rankq'], ascending=True, inplace=True)
dfconf.reset_index(inplace=True)
dfconf.drop(['level_1'], axis=1, inplace=True)
print(my.metadf(dfconf))
dffinal = df.merge(dfconf, left_on='conference', right_on='conference', how='inner')
dffinal.sort_values(['rankq','rank'], ascending=True, inplace=True)
dffinal.reset_index(inplace=True)
dffinal.drop(['index'], axis=1, inplace=True)
print(my.metadf(dffinal))

parms = pd.DataFrame({'title':["Men's Division I Basketball NET Rankings"],
                               'stitle':[dte],'xlab':["\nConference"],'ylab':["NET Rank\n"]})
print(my.metadf(parms))

rhoops = robjects.pandas2ri.py2ri(dffinal)
rparms = robjects.pandas2ri.py2ri(parms)
my.blanks(2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 3 columns):
conference    32 non-null object
N             32 non-null int32
rankq         32 non-null int32
dtypes: int32(2), object(1)
memory usage: 592.0+ bytes
None

  conference   N  rankq
0    Big Ten  14    158
1     Big 12  10    190
2        ACC  15    200
3        SEC  14    221
4   Big East  10    295
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 353 entries, 0 to 352
Data columns (total 11 columns):
rank          353 non-null int16
previous      353 non-null int16
school        353 non-null object
conference    353 non-null object
record        353 non-null object
road          353 non-null object
neutral       353 non-null object
home          353 non-null object
nondivi       353 non-null object
N             353 non-null int32
rankq         353 non-null int32
dtypes: int16(2), int32(2), object(7)
memory usage: 23.5+ KB
None

   rank  previous        school conference record road neutral  home nondivi  \
0     3         3      Michigan    Big Ten   16-0  3-0     2-0  11-0     0-0   
1     7         7  Michigan St.    Big Ten   14-2  3-1     2-1   9-0     0-0   
2    14        13      Nebraska    Big Ten   12-4  1-3     2-1   8-0     1-0   
3    16        15     Wisconsin    Big Ten   11-4  3-2     2-1   6-1     0-0   
4    18        18       Indiana    Big Ten   12-3  1-3     1-0  10-0     0-0   
    N  rankq  
0  14    158  
1  14    158  
2  14    158  
3  14    158  
4  14    158  
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 4 columns):
title     1 non-null object
stitle    1 non-null object
xlab      1 non-null object
ylab      1 non-null object
dtypes: object(4)
memory usage: 112.0+ bytes
None

                                      title                      stitle  \
0  Men's Division I Basketball NET Rankings  Through Games JAN 10, 2019   
           xlab        ylab  
0  \nConference  NET Rank\n  
None

Use RMagic to load pertinent R packages.

In [16]:
%R require(tidyverse); require(data.table); require(RColorBrewer); require(R.utils)
Out[16]:
array([1], dtype=int32)

Pass the R dataframes to ggplot for graphing. At this time, the Big Ten, Big 12, ACC, and SEC are my top conferences.

In [17]:
%%R  -w700 -h700 -i rhoops -i rparms
options(scipen=10)
pal <- brewer.pal(9,"Blues")
howmany <- 12
rhoops <- data.table(rhoops)
rhoops[,conference := fct_reorder(conference, rankq)]
lab <- rhoops[,.(rankq=rankq[1]),by=conference][order(rankq)]
pal <- brewer.pal(9,"Blues")

g <- ggplot(rhoops[conference %in% levels(rhoops$conference)[1:howmany]], 
            aes(x=conference,y=rank,fill=conference)) +
    geom_violin(trim = FALSE,draw_quantiles = c(0.25,0.5,0.75)) +
    theme(legend.position = "none", plot.background = element_rect(fill = pal[2]), 
    panel.background = element_rect(fill = pal[2])) +
    geom_hline(aes(yintercept=median(rank, na.rm=T)),   
    color="black", linetype="dashed", size=.25) +
    theme(axis.text.x = element_text(size=7, angle = 45, hjust = 1)) +
    theme(strip.text.x = element_text(size = 5)) +
    geom_jitter(height = 1, width = .1, size=1) +
    scale_x_discrete(labels=with(lab,paste(as.character(conference),round(rankq),sep=" -- ")[1:howmany])) +
    labs(title=rparms$title,
         subtitle=rparms$stitle, y=rparms$ylab, x=rparms$xlab)   
print(g)

Finally, list rankings of schools in the top three conferences.

In [18]:
cols = dffinal.columns[0:5]
print(dffinal.loc[dffinal.conference=="Big Ten",cols])
my.blanks(2)
print(dffinal.loc[dffinal.conference=="Big 12",cols])
my.blanks(2)
print(dffinal.loc[dffinal.conference=="ACC",cols])
my.blanks(2)
    rank  previous        school conference record
0      3         3      Michigan    Big Ten   16-0
1      7         7  Michigan St.    Big Ten   14-2
2     14        13      Nebraska    Big Ten   12-4
3     16        15     Wisconsin    Big Ten   11-4
4     18        18       Indiana    Big Ten   12-3
5     24        25      Maryland    Big Ten   13-3
6     25        24        Purdue    Big Ten    9-6
7     35        34          Iowa    Big Ten   13-3
8     36        35      Ohio St.    Big Ten   12-3
9     56        56     Minnesota    Big Ten   12-3
10    61        60  Northwestern    Big Ten   10-6
11    74        77      Penn St.    Big Ten    7-9
12   106       110       Rutgers    Big Ten    8-6
13   117       124      Illinois    Big Ten   4-12

    rank  previous         school conference record
14     4         4     Texas Tech     Big 12   14-1
15    11        12         Kansas     Big 12   13-2
16    15        16       Oklahoma     Big 12   12-3
17    19        19       Iowa St.     Big 12   12-3
18    34        32            TCU     Big 12   12-2
19    49        47          Texas     Big 12   10-5
20    53        51     Kansas St.     Big 12   11-4
21    72        66   Oklahoma St.     Big 12    7-8
22    73        75         Baylor     Big 12    9-5
23    87        88  West Virginia     Big 12    8-7

    rank  previous          school conference record
24     1         1        Virginia        ACC   14-0
25     2         2            Duke        ACC   13-1
26     8         8  North Carolina        ACC   12-3
27    10        10   Virginia Tech        ACC   14-1
28    17        17        NC State        ACC   13-2
29    21        21     Florida St.        ACC   13-2
30    32        30      Louisville        ACC   10-5
31    46        46        Syracuse        ACC   11-4
32    59        58         Clemson        ACC   10-5
33    69        68      Pittsburgh        ACC   11-4
34    78        84      Notre Dame        ACC   10-5
35    81        86      Miami (FL)        ACC    8-7
36   102       101    Georgia Tech        ACC    9-6
37   148       150  Boston College        ACC    9-5
38   199       195     Wake Forest        ACC    7-7

That’s it for now. Happy New Year to ODSC readers!

Steve Miller

Steve Miller

Steve is a data scientist focusing on research methods, data integration, statistics/ML, and computational analytics in R/python. Follow him on LinkedIn: https://www.linkedin.com/in/steve-miller-58ab881/

1