Iterating over hash sets quickly in Java
There are many ways in software to represent a set. The most common approach is to use a hash table. We define a “hash function” that takes as an input our elements and produces as an output an integer that “looks random”. Then your element is stored at the... Read more
While you wait for that to finish, can I interest you in parallel processing?
caret has been able to utilize parallel processing for some time (before it was on CRAN in October 2007) using slightly different versions of the package. Around September of 2011, caret started using the foreach package was used to “harmonize” the parallel processing technologies thanks to a super smart guy named Steve Weston. I’ve done a... Read more
Word Vectors with Tidy Data Principles
Last week I saw Chris Moody’s post on the Stitch Fix blog about calculating word vectors from a corpus of text using word counts and matrix factorization, and I was so excited! This blog post illustrates how to implement that approach to find word vector representations in R using tidy data... Read more
It’s been a couple of weeks since I got accepted in the closed beta testing programme for IBM Data Science Experience (DSX), and it is about time I share my thoughts on this offering.DSX is a new product, which IBM is positioning as a new generation Data Science development and training... Read more
Plotting author statistics for Git repos using Git of Theseus
I spent a few days during the holidays fixing up a bunch of semi-dormant open source projects and I have a couple of blog posts in the pipeline about various updates. First up, I made a number of fixes to Git of Theseus which is a tool (written in Python) that... Read more
Happy, Healthy, Hungry. Mapping San Francisco Restaurant Cleanliness
Somewhat recently, Yelp announced that it is partnering with Code for America and the City of San Francisco to develop LIVES, an open data standard which allows municipalities to publish restaurant inspection data in a standardized format. This is a step towards allows a much much more transparent government,... Read more
In a previous article, we discussed the origin story and history of the Python deep learning library TensorFlow. It’s experienced a monumental rise like nothing seen before, in just two years since its debut it currently holds the title of the most forked repo on GitHub. TensorFlow’s significance doesn’t... Read more
How Docker Can Help You Become A More Effective Data Scientist
For the past 5 years, I have heard lots of buzz about docker containers. It seemed like all my software engineering friends are using them for developing applications. I wanted to figure out how this technology could make me more effective but I found tutorials online either too detailed:... Read more
On Machine Learning and Programming Languages
This article was co-written by Mike Innes (Julia Computing), David Barber (UCL), Tim Besard (UGent), James Bradbury (Salesforce Research), Valentin Churavy (MIT), Simon Danisch (MIT), Alan Edelman (MIT), Stefan Karpinski (Julia Computing), Jon Malmaud (MIT), Jarrett Revels (MIT), Viral Shah (Julia Computing), Pontus Stenetorp (UCL) and Deniz Yuret (Koç... Read more
Ripyr: Sampled Metrics on Datasets Using Python’s Asuncio
Today I’d like to introduce a little python library I’ve toyed around with here and there for the past year or so, ripyr. Originally it was written just as an excuse to try out some newer features in modern python: asyncio and type hinting. The whole package is type... Read more