Introduction to Data Version Control Introduction to Data Version Control
Any production-level system requires some kind of versioning. A single source of current truth. Any resources that are continuously updated, especially... Introduction to Data Version Control

Any production-level system requires some kind of versioning. A single source of current truth. Any resources that are continuously updated, especially simultaneously by multiple users, require some kind of an audit trail to keep track of all changes.

In software engineering, the solution to this is Git. If you have written code in your life, then you are probably familiar with the beauty that is Git. Git allows us to commit changes, create different branches from a source, and merge back our branches, to the original to name a few.

DVC is purely the same paradigm but for datasets. See, live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets. This leads to multiple versions of the same dataset, which is definitely not a single source of truth.

Additionally, in a machine learning environment, we would also have several versions of the same ‘model’ trained on different versions of the same dataset (for instance, model re-training to include newer data points). If not properly audited and versioned, this would create a tangled web of datasets and experiments. We definitely do not want that!

DVC is, therefore, a system that involves tracking our datasets by registering changes on a particular dataset. There are multiple DVC solutions both free and paid. I recently discovered Hangar, a fully open-source Python DVC package. Let’s have a look at what it can do, shall we?

Working with Hangar

The hangar package is a pure Python implementation and is available through pip. Its core functionality is also closely developed to git, which greatly helps the learning curve. We also have the option to either interact with hangar via the command line or use its dedicated Python client.

Some functionality available include:

Note: the remote repository is the single source of current truth.

The positive thing to note here is that Hangar is not built on top of git but rather emulates the functionality of git. This makes it faster.

We can install hangar through pip using:

pip install hangar

After installing Hangar, we can import the package directly to Python.

The first thing that we need to do to work with Hangar is to create a data repository. We can import the Repository class from the Hangar package and use it to define our repository.

If it’s our first time working with a particular repository, we have to also initialise it using the init() function.

from hangar import Repository
import os

repo_name = 'test'

if not os.path.isdir(repo_name):
    print(f'{repo_name} directory was not found. Creating an empty directory.')
repo = Repository(path=repo_name)
print(f'Connected to {repo_name}')

if not repo.initialized:
    print(f'Initialising {repo_name}')
        user_name="David Farrugia", user_email="davidfarrugia@gmail.com", remove_old=True

Before we can continue with our data versioning example, let us first discuss the methodology behind Hangar.

Approaching Hangar

The main learning curve behind Hangar is understanding the best way to interact with the package. Hangar involves four main components:

The Repository

We can think about the repository as our project warehouse. The repository is essentially a collection and history of the commits performed.

Ideally, every project has its own repository. For example, if we have two main tasks — predicting handwritten digits and predicting fraud — we also create two repositories respectively.

The Dataset

This one is simple. The dataset is, you guessed it, our dataset. But what is a dataset exactly? Let’s take the Titanic dataset for analogy. What makes up the dataset?

Is it the individual samples? Is it the variables monitored? And here is where we can get pretty creative with things. Hangar describes a dataset as a collection of columns. We will get into it next.

The Column

The column can be any data property or attribute which we like. It can be an array of features, an array of labels, an array of feature names, or even an array of unique identifiers. Every item in the column array should, however, correspond to an individual sample in the dataset. At the moment, the supported Column types are:

For instance, if we have a dataset of 28×28 images, we would opt for an array column (with every sample having a shape of 28×28) to represent the actual numerical data. We can use bytes or string columns to store its label, and a string column to store the image file name.

Image by author

Of course, the above is merely a guide on how to structure your dataset. The type of data that you are working with, as well as the type of experiments to be done, all impact the structuring strategy for Hangar. For example, one might also opt to have a dedicated column for training data and another for validation data.

A Column should be a collection of data samples. We start off with an empty collection, and on sample addition, the collection index increases.

The Data

And finally, the data. Once we have figured out which Columns to have, processing the data accordingly becomes a relatively simple task. The data itself is just numbers. It doesn’t have any direct meaning, and it doesn’t have any structure.

With that out of the way, let’s proceed with the rest of our example.

Assuming we have a tabular classification dataset — df — we will simply store the entire dataset in a single column as bytes.

We start off by creating a WriterCheckout. The WriterCheckout object allows us to enable a specific branch (in our case, we only have a single branch: master) with write access (i.e. with the ability to write and commit changes to the active branch). We do this using master = repo.checkout(write=True).

We can then instruct Hangar to create a bytes Column called ‘data’ by calling the add_bytes_column. Since this is our first commit, our Column is still empty. For our first commit, we can commit our data at index 0. Since we specified our Column as a bytes object, we must first convert our data to a bytes object. We can eventually call the commit function to commit and save our changes. Below, we show a code example of what we just discussed.

import pickle

# get the WriterCheckout
master = repo.checkout(write=True)

# Add a New Column

# Add the Data
master['data'][0] = pickle.dumps(df, protocol=4)

# Commit
master.commit('This is our first commit!')

# Close the WriterCheckout Lock

Note: Hangar does not allow more than one WriterCheckouts to be in circulation to avoid conflicts. Thus, when not using a WriterCheckout, be sure to close it. If a write lock is already in circulation, we would only be allowed to checkout in read-only mode.

If we want to add another commit to the same Column, we follow the same process but instead commit to master['data'][1], and so on for future commits. Every commit will also have a hash key bound to it.

Branching in Hangar

Branching becomes particularly useful when we want to get a copy of the data at a specific point to run custom experiments on it without actually changing it. We can branch out and after we confirm that our processing is correct, we also have the ability to merge back to the mainstream. The typical branching flow looks something like this:

Create Branch -> Checkout Branch -> Make Changes -> Commit -> Merge

We can create branches usingrepo.create_branch(name='test'), and merge as follows:

master.merge(message='message for merge', dev_branch='test')

Every commit in Hangar is given a hash key. We can use that hash to pinpoint exactly the branching point:

test_branch2 = repo.create_branch(name='test2', base_commit=<SOME_HASH_KEY>)

By callingrepo.log(), we can get a log summary of the current branches and their latest commit. An example log would look something like this:

* a=cf94cf8b4c5758c885c6b84d58c4fbe22f379510 (test2): added new test branch
* a=a8fe61916764b873f13c80a14ce4fda610b74df9 (test) (master): Base Dataset

We can get the difference and conflicts between branches as follows:

repo.diff('master', 'test2')

Concluding Remarks

In this post, we went over the Hangar package as an open-source solution for DVC in Python. Is this all that Hangar offers? Definitely not! We introduced the fundamentals and discovered how we can get started with Hangar. As always, I highly encourage you to go over their documentation and practice with your own use-case.

Article originally posted here. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.