

An Introduction to Object Oriented Data Science in Python
PythonTools & Languagesposted by megan@odsc.com August 31, 2016

A lot of focus in the data science community is on reducing the complexity and time involved in data gathering, cleaning, and organization. This article discusses how object oriented design techniques from software engineering can be used to reduce coding overhead and create robust, reusable data acquisition and cleaning systems. I’ll provide an overview of object oriented data science design and walk through an example of using these techniques for getting and cleaning data from a web API in Python. You can find the Jupyter Notebook for this post on Github.
[Related article: Implementing a Kernel Principal Component Analysis in Python]
Object Oriented Data Science Design
Much of modern software engineering leverages the principles of Object Oriented Design (OOD), also called object oriented programming (OOP), to create codebases that are easy to scale, test, and maintain. As the name suggests, this programming paradigm is centered on thinking of code in terms of objects. An object encapsulates data, attributes, and methods relating to a specific entity. An object is defined using a class, which can then be instantiated to create multiple objects, referred to as instances of the class.
We’ll be working through some examples of class design using the Recreational Information Database API, a JSON REST API for finding recreation opportunities in the US.
If you are unfamiliar with object oriented design in Python, the Python Tutorial has some great information for getting started. As we are walking through code examples I’ll be talking about things at a high level, if you are interested in the specifics of Python class design, for example, “what is this “self” parameter?”, check out the tutorial for more information.
Now that we’ve covered the basic idea of an object, lets consider creating an object to encapsulate a data source, recalling that an object has methods, attributes, and data. To start identifying potential methods of our data source class, consider typical activities used when interacting with a data source. We likely want to extract data from it, perhaps via a database or API. Once we acquire data, we probably want to clean and format it to be consumed by other activities – visualization, analysis, or feature development. Looking at this from an object oriented point of view, methods ‘extract’ and ‘clean’ could cover these activities. Attributes of a data source object would include a construct for holding the data we have extracted and cleaned, and perhaps some identifying information such as the request URL and parameters we pass to it. The data for our data source object would be the values for the URL, parameters, and a DataFrame containing the extracted and cleaned data. This is a UML diagram of our class, named RidbData. UML diagrams provide a visual representation of class structure and relationships. The top section shows the class name, with the middle containing attribute names and the bottom section containing method names. Typically methods would be accompanied by their respective parameters, but these are omitted for brevity here.
Lets take a look at some Python code for creating this data source object. We’ll walk through each part of this code below:
import pandas as pd
import requests
import json
from pandas.io.json import json_normalize
import config
import numpy as np
class RidbData():
def __init__(self, name, endpoint, url_params):
self.df = pd.DataFrame()
self.endpoint = endpoint
self.url_params = url_params
self.name = name
def clean(self) :
# by replacing '' with np.NaN we can use dropna to remove rows missing
# required data, like lat/longs
self.df = self.df.replace('', np.nan)
# normalize column names for lat and long. i.e. can be
# FacilityLatitude or RecAreaLatitude
self.df.columns = self.df.columns.str.replace('.*Latitude', 'Latitude')
self.df.columns = self.df.columns.str.replace('.*Longitude', 'Longitude')
self.df = self.df.dropna(subset=['Latitude','Longitude'])
def extract(self):
request_url = self.endpoint
response = requests.get(url=self.endpoint,params=self.url_params)
data = json.loads(response.text)
self.df = json_normalize(data['RECDATA'])
To create the object we need a constructor method, __init__. The constructor can be used to set attributes of our object and perform any initialization routines. RIDB has a variety of different end points, so we will have to specify which endpoint we want to query when creating the RidbData object in the ‘endpoint’ parameter, and any url parameters we need to set such as our RIDB API key in the ‘url_params’ parameter. The name attribute will help us identify this object later on when we are working with multiple RidbData objects.
In our extract method we’ll query the endpoint and load the JSON response into the DataFrame attribute ‘df’. We have a ‘clean’ method to insert NaN in place of empty strings and drop any entries that don’t have latitude / longitude values, since we aren’t interested in facilities that don’t have a location.
Why Objects?
Lets take a step back for a minute. At this point you may be wondering why we wrote all that code instead of simply make a function:
def get_ridb_data(endpoint,url_params): response = requests.get(url = endpoint, params = url_params) data = json.loads(response.text) df = json_normalize(data['RECDATA']) df = df.replace('', np.nan) df.columns = df.columns.str.replace('.*Latitude', 'Latitude') df.columns = df.columns.str.replace('.*Longitude', 'Longitude') df = df.dropna(subset=['Latitude','Longitude']) return df
I’m glad you asked!
If all the RIDB endpoints had the same response and endpoint configuration, a function like this would work fine. The time to consider using object oriented data science techniques is when you find yourself writing a lot of specialized functions and ‘if’ statements to make small tweaks to your code for special cases. For example, when we get data from the facilities endpoint we want to drop any that do not have latitude and longitude, whereas when we query the media endpoint we decide to capture only the image data:
def get_ridb_data(endpoint,url_params): response = requests.get(url = endpoint, params = url_params) data = json.loads(response.text) df = json_normalize(data['RECDATA']) df = df.replace('', np.nan) df.columns = df.columns.str.replace('.*Latitude', 'Latitude') df.columns = df.columns.str.replace('.*Longitude', 'Longitude') df = df.dropna(subset=['Latitude','Longitude']) return df def get_ridb_facility_media(endpoint, url_params): # endpoint = https://ridb.recreation.gov/api/v1/facilities/facilityID/media/ response = requests.get(url = endpoint, params = url_params) data = json.loads(response.text) df = json_normalize(data['RECDATA']) df = df[df['MediaType'] == 'Image'] return df
Notice that we have some lines that are the same between the two functions:
response = requests.get(url = endpoint, params = url_params) data = json.loads(response.text) df = json_normalize(data['RECDATA'])
Another best practice in programming is called the DRY principle – Don’t Repeat Yourself. You may be saying “Pfft! Its just three lines!” But what if RIDB changes their response record name from ‘RECDATA’ to ‘RECDAT’ ? Then you have to track down every instance of ‘RECDATA’ in your code and replace it. Also, consider that three lines is 75% of the code for the ‘get_ridb_facility_media’ method.
Lets look at querying these two endpoints using the RidbData object we created above. For the facilities endpoint, we are pretty much ready to go. Just plug in our endpoint and the object methods will take care of the rest:
facilities = RidbData('facilities_name', 'https://ridb.recreation.gov/api/v1/facilities', dict(apiKey = 'MY_RIDB_API_KEY')) facilities.extract() facilities.df.head()
Here is a snapshot of the data after running extract(). Notice that we have some blank rows that will be filled with NaNs:
Running the clean() method and inspect the DataFrame:
facilities.clean() facilities.df.head()
The NaNs are populated and the FacilityLatitude column is now named Latitude. Depending on the endpoint, RIDB will provide a different prefix for Latitude and Longitude. By cleaning this up to simply Latitude and Longitude we are standardizing the datasets for downstream analysis.
We can then create additional instances of this object to get and clean data for any similar endpoint, such as recreation areas:
recareas = RidbData('recareas_name', 'https://ridb.recreation.gov/api/v1/recareas', dict(apiKey = 'MY_RIDB_API_KEY')) recareas.extract() recareas.clean() recareas.df.head()
Taking a look at the recareas data above, we can confirm the column name changes and NaN replacement.
Now, if the RIDB API changes its ‘RECDATA’ record name, we just have to update the code in one place: the RidbData class. Here are the two instances we’ve generated of the RidbData class. In UML, a class instance is denoted with the top section showing “instance name : class name” and the middle section showing the values of the attributes.
We’ll address the different ‘clean’ method needed for the media endpoint next.
Extending Classes
One of the principles of OOD is the open closed principle: classes should be closed for modification, but open for extension. This means that once a class is complete, tested, and verified to be working as expected we want to set it aside as done. Any time you touch a piece of code you create the possibility of new bugs, with the open/closed principle we can reduce the likelihood of bugs, and also guarantee that a class is safe to extend and use by others since it will not be changed in the future.
So how do we modify the functionality of an existing class? In our example, we have a media endpoint that requires a different clean method than what we have coded in the RidbData class. We can extend the RidbData class and provide a new clean clean method, while inheriting the functionality of the constructor and extract methods
class RidbMediaData(RidbData): def clean(self) : self.df = self.df[self.df['MediaType'] == 'Image']
When we create the new class, RidbMediaData, we pass the RidbData class to the class definition. This indicates that RidbMediaData is a derived class from the base class RidbData. RidbData would also be called the superclass of RidbData. Mostly this is to inform you of the language used around this construct; the take away is that RidbMediaData inherits the methods and attributes of the RidbData class, so it doesn’t have to implement the __init__ constructor method or the extract method – it will get those implementations from RidbData. The only thing you need to implement in a derived class is the methods or attributes that differ from that of the base class.
Taking a look at our diagram, here is how a derived class : base class relationship is drawn. We added a new clean() method for the RidbMediaData derived class, but we didn’t add any new attributes so the middle section is empty. The open arrow head indicates inheritance from the base class RidbData. RidbMediaData inherits all the attributes and methods of RidbData, and provides a new clean() method.
Using our new RidbMediaData class, we can get the media for FacilityID 20006:
facility200006_media = RidbMediaData('facility10media', 'https://ridb.recreation.gov/api/v1/facilities/200006/media', dict(apiKey = 'MY_RIDB_API_KEY')) facility200006_media.extract() facility200006_media.clean() facility200006_media.df
The media endpoint returns an EntityID, which is the same as the FacilityID. We retrieved one image for FacilityID 200006. Thats pretty good, but what if we wanted to get the media for several facilities? Should we create a new object for every facility? That doesn’t seem like a good use of resources. Instead, we can also provide a new implementation the extract function in the RidbMediaData object to cycle through a list of FacilityIDs:
class RidbMediaData(RidbData): def clean(self) : self.df = self.df[self.df['MediaType'] == 'Image'] def extract(self): request_url = self.endpoint for index, param_set in self.url_params.iterrows(): facility_id = param_set['facilityID'] req_url = self.endpoint + str(facility_id) + "/media" response = requests.get(url=req_url,params=dict(apiKey=param_set['apiKey'])) data = json.loads(response.text) # append new records to self.df if any exist if data['RECDATA']: new_entry = json_normalize(data['RECDATA']) self.df = self.df.append(new_entry)
To use extract to get media from several facilities at once, we have to make some changes to our constructor parameters ‘endpoint’ and ‘url_params’. For the endpoint, we will pass the facilities endpoint and append the facilityID and “/media” to create the address for each facility. The url_params will become a DataFrame of API key / facilityID pairs for each facility:
media_url = 'https://ridb.recreation.gov/api/v1/facilities/' media_params = pd.DataFrame({ 'apiKey':config.API_KEY, 'facilityID':[200001, 200002, 200003, 200004, 200005, 200006, 200007, 200008] }) ridb_media = RidbMediaData('media', media_url, media_params) ridb_media.extract() ridb_media.clean() ridb_media.df.head()
Putting it all together
We now have two objects we can use to extract and clean data from the RIDB API. In addition to the benefit of reduced repeated code through inheritance, we also have a uniform interface for all of these data sources. We can make use of this to create a two line data extraction pipeline! First lets setup our objects and endpoints.
facilities_endpoint = 'https://ridb.recreation.gov/api/v1/facilities/' recareas_endpoint = 'https://ridb.recreation.gov/api/v1/recareas' key_dict = dict(apiKey = config.API_KEY) facilities = RidbData('facilities', facilities_endpoint, key_dict) recareas = RidbData('recareas', recareas_endpoint, key_dict) facility_media = RidbMediaData('facilitymedia', facilities_endpoint, media_params) ridb_data = [facilities,recareas,facility_media]
Now the really neat part; because our objects all have the same interface, we can extract and clean the data for all of them in two lines:
list(map(lambda x: x.extract(), ridb_data)) list(map(lambda x: x.clean(), ridb_data))
You can now examine the cleaned data for each object in the ridb_data list, i.e.:
facilities.df.head()
Give it a go for yourself! You can find the Jupyter Notebook for this post on Github.
Summary
We’ve seen some ways OOD paradigms can give us scalable, sharable code for data analysis. Some key takeaways:
- Through inheritance, different objects can share the same code, reducing the likelihood that bugs will be introduced through changes in functionality and enabling us to follow the DRY principle.
- By encapsulating the data with the associated methods in an object, we provide an implicit guarantee of the data shape and the manipulations it has undergone. This is especially important for keeping track of data manipulation during feature development.
- Creating uniform interfaces through OOD principles can help us streamline downstream operations.
As you’re writing code, keep an eye out for these “code smells” to identify potential opportunities for OOD to help:
- Repeating yourself with slight tweaks to accommodate differences between data sources
- Just different URLs, database connection strings, or file names? A well-parameterized function will probably suit you just fine.
- Finding yourself writing a lot of ‘if’ statements in your extraction code? Probably time to refactor and consider using classes
- Finding similar functions across vertical stacks
- We looked across the process used for working with different data sources and identified similar functions for cleaning and acquiring data. If you can identify similarities like this in your code OOD may help.
[Related article: Web Scraping News Articles in Python]
I hope you’ve found this introduction to object oriented data science helpful for thinking about how you can organize your data analysis code to be more efficient and robust. There are many other ways OOD can be leveraged for data science work, including using abstract base classes for interface definition, writing robust web scrapers through inheritance, and streamlining machine learning prototyping through encapsulating feature development.
©ODSC 2016. Feel free to send Sev questions about his post on object oriented data science at sev@thedatascout.com