fbpx
Git Tip: Apply a Patch
I learned a new thing this weekend: we apparently can apply a patch onto a branch/fork using git apply . There’s a few things to unpack here. First off, what’s a patchfile? The long story cut short is that a patchfile is nothing more than a plain text file that... Read more
Survey Analysis in SQL and R
Charco Hui, as his Honours project in Statistics, has been writing a package for complex-survey analysis using dplyr and dbplyr. It’s here. At the moment it has only been tested with MonetDB, using the github version (0.5.2) of MonetDBlite, but it should work with many other databases (not SQLite, at the... Read more
Dask Scaling Limits
This work is supported by Anaconda Inc. History For the first year of Dask’s life it focused exclusively on single node parallelism. We felt then that efficiently supporting 100+GB datasets on personal laptops or 1TB datasets on large workstations was a sweet spot for productivity, especially when... Read more
IOTA – The Potential to Drive Data Science for IoT
I have a close circle of clued-on/tech savvy friends whose views I take seriously. For the last few weeks, one of these friends has been sending me emails extolling the merits of something called IOTA – which calls itself as the next generation Blockchain.  At first, I... Read more
Are data warehouses a thing of the past?
With almost everything around us becoming a source of data, it’s proving to be quite a challenge for traditional data warehouses to support such fast changing and high on volume data. So is data warehouse a thing of the past already? A huge collection of data... Read more
How To Create Data Products That Are Magical Using Sequence-to-Sequence Models
A tutorial on how to summarize text and generate features from Github Issues using deep learning with Keras and TensorFlow. Teaser: Training a model to summarize Github Issues Predictions are in rectangular boxes. The above results are randomly selected elements of a holdout set. Keep reading below, there will be a link... Read more
Word Vectors with Tidy Data Principles
Last week I saw Chris Moody’s post on the Stitch Fix blog about calculating word vectors from a corpus of text using word counts and matrix factorization, and I was so excited! This blog post illustrates how to implement that approach to find word vector representations in R... Read more
This is the first post of a series of three articles in which we will discuss tips and guidelines for successful data science implementations. This post goes over the things you should worry about before to write the first line of code. A high level data... Read more
How Do You Discover R Packages?
Like I mentioned in my last blog post, I am contributing to a session at userR 2017 this coming July that will focus on discovering and learning about R packages. This is an increasingly important issue for R users as we all decide which of the... Read more
Standard software development practices for web, Saas, and industrial environments tend to focus on maintainability, code quality, robustness, and performance. Scientific programing in data science is more concerned with exploration, experimentation, making demos, collaborating, and sharing results. It is this very need for experiments, explorations, and... Read more