How To Create Data Products That Are Magical Using Sequence-to-Sequence Models
A tutorial on how to summarize text and generate features from Github Issues using deep learning with Keras and TensorFlow. Teaser: Training a model to summarize Github Issues Predictions are in rectangular boxes. The above results are randomly selected elements of a holdout set. Keep reading below, there will be a link to many more... Read more
Watermain Breaks in the City of Toronto
It has been a while since my last post due to the major transition of moving back to Canada. This post will be a bit shorter than my previous ones but hopefully it will give some insight on practically investigating and analyzing open data that are becoming more popular... Read more
Plotting author statistics for Git repos using Git of Theseus
I spent a few days during the holidays fixing up a bunch of semi-dormant open source projects and I have a couple of blog posts in the pipeline about various updates. First up, I made a number of fixes to Git of Theseus which is a tool (written in Python) that... Read more
There are so many different aspects of training a neural network that will affect its performance. However, many people spend too much time thinking about learning rates, neuron structures, and epochs before actually using correctly optimized data. Without properly formatting data, your neural network will be useless, regardless of... Read more
Some things I’d like you to know about Data Science
Things I’ve learned mostly by making mistakes Masses of data + cutting edge machine learning + cheap compute = Profit. Right? It’s not that simple. Data science isn’t a replacement for asking difficult questions and doing hard work based on the answers. In fact, it’s quite the opposite. Enabled by... Read more
Big aggregate queries can still violate privacy
Suppose you want to prevent your data science team from being able to find out information on individual customers, but you do want them to be able to get overall statistics. So you implement two policies. Data scientists can only query aggregate statistics, such as counts and averages. These... Read more
Business Analytics: Requirements for Data Transformation
There is a major change happening in the IT industry — the use of big data and analytics to guide how businesses are run. Many companies are embracing analytics as part of their core strategies. Unfortunately, some of those companies think that they can just purchase an analytics solution,... Read more
Why Blockchain Will Improve Your Big Data
The rise of cloud storage has helped companies collect and manage massive amounts of data. Data comes from corporate systems, Internet of Things objects and unstructured sources like online forums. New analytics tools like Hadoop help companies make sense of that data. Yet simply having data and analysis tools... Read more
Custom Level Coding in vtreat
One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA “one-hot encoding”). Level coding can be computationally and statistically preferable to one-hot encoding for... Read more
This post is the first of a two-part series in which we apply NLP techniques to analyze articles about big data, data science, and AI. If you are tired of the hassles of web scraping, then this post might be just for you. I occasionally web scrape news articles from the... Read more