Git-Pandas caching for Faster Analysis
Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories. It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow, especially with many large repositories. In the past we’ve made that work with running analyses offline, and by sampling, but really most of the work run-to-run is repeated.
Enter caching. There are a few places in the codebase that we can cache result-sets by revision key and get pretty significant performance boosts when using the library in something like gitnoc. And it turns out, it’s pretty straight forward.
Currently in develop, we’ve got a new module with a custom python decorator to handle caching by different mechanisms:
def multicache(key_prefix, key_list, skip_if=None): def multicache_nest(func): def deco(self, *args, **kwargs): if self.cache_backend is None: return func(self, *args, **kwargs) else: if skip_if is not None: if skip_if(kwargs): return func(self, *args, **kwargs) key = key_prefix + self.repo_name + '_'.join([str(kwargs.get(k)) for k in key_list]) try: if isinstance(self.cache_backend, EphemeralCache): ret = self.cache_backend.get(key) return ret elif isinstance(self.cache_backend, RedisDFCache): ret = self.cache_backend.get(key) return ret else: raise ValueError('Unknown cache backend type') except CacheMissException as e: ret = func(self, *args, **kwargs) self.cache_backend.set(key, ret) return ret return deco return multicache_nest
It looks pretty convoluted, but ends up being pretty useful. It creates a decorator that we can use on any method in the Repository class, where one can specify a caching_backend (currently we have in-memory-ephemeral and redis based options), a key_prefix to use, a list of kwarg keys to use in the cache key, and optionally a lambda function to apply over the kwargs that returns whether to skip caching.
The lambda is in particular useful for cases we have like not wanting to cache the results for rev=’HEAD’, since that can change moment to moment.
Each of the two caching backends implements your basic get/set/purge functionality, and lets you set a maximum number of keys to have something like an LRU cache.
One interesting nugget from the Redis cache was that the objects we are caching are always, in git-pandas, pandas dataframes. To store those in Redis we can serialize/deserialize the dfs with:
- # self._cache is a connection to redis
- self._cache.set(k, v.to_msgpack(compress=‘zlib’), ex=self.ttl)
- df = pd.read_msgpack(self._cache.get(k))
It’s still being tested, and is probably one of the last things we will cram in before releasing git-pandas 2.0.0, so check out the repository over on github, try it out, and let me know what you think.
Originally posted at www.willmcginnis.com/