The Last 5 Years In Deep Learning
Introduction As we’re nearing the end of 2017 (and coincidentally the first day of NIPS 2017), we’ve come to the 5 year landmark of deep learning really starting to hit the mainstream. For me, I think of AlexNet and the 2012 Imagenet competition as the coming out party (although researchers have... Read more
Some things I’d like you to know about Data Science
Things I’ve learned mostly by making mistakes Masses of data + cutting edge machine learning + cheap compute = Profit. Right? It’s not that simple. Data science isn’t a replacement for asking difficult questions and doing hard work based on the answers. In fact, it’s quite the... Read more
CatBoost: Yandex’s machine learning algorithm is available free of charge
Russia’s Internet giant Yandex has launched CatBoost, an open source machine learning service. The algorithm has already been integrated by the European Organization for Nuclear Research to analyze data from the Large Hadron Collider, the world’s most sophisticated experimental facility. Machine learning helps make decisions by... Read more
A conversation with Thomas Wiecki on the use of probabilistic programming and machine learning in quant finance.
Not surprisingly… hedge funds and especially quant funds are notorious for being secretive about the algorithms they employ to beat the market. A Boston based startup is taking a different approach. Thomas Wiecki is the Head of Research at Quantopian, which hosts an open source platform... Read more
You weren’t supposed to actually implement it, Google
Last month, I wrote a blog post warning about how, if you follow popular trends in NLP, you can easily accidentally make a classifier that is pretty racist. To demonstrate this, I included the very simple code, as a “cautionary tutorial.” The post got a fair amount... Read more
A decade of using text-mining for citation function classification
Academic work is typically filled with references to previous work. Unfortunately, most of these references have, at best, a tangential relevance. Thus you cannot trust that a paper that cites another actually “builds on it”. A more likely scenario is that the authors of the latest... Read more
Big aggregate queries can still violate privacy
Suppose you want to prevent your data science team from being able to find out information on individual customers, but you do want them to be able to get overall statistics. So you implement two policies. Data scientists can only query aggregate statistics, such as counts... Read more
Linked Data and Data Science
Understanding Gender Roles in Movies with Text Mining
I have a new visual essay up at The Pudding, using text mining to explore how women are portrayed in film.   In April 2016, we broke down film dialogue by gender. The essay presented an imbalance in which men delivered more lines than women across 2,000 screenplays. But... Read more
Natural Language Processing in a Kaggle Competition for Movie Reviews
I decided to try playing around with a Kaggle competition. In this case, I entered the “When bag of words meets bags of popcorn” contest. This contest isn’t for money; it is just a way to learn about various machine learning approaches. The competition was trying to showcase... Read more