Sheddable Requests: The Intersection of Hackweeks, Book Clubs, and Site Reliability Engineering
One of the things I love about working at Civis is the opportunity we have for continuous learning. Not long ago I had the opportunity to be involved in a book club which read through Google’s Site Reliability Engineering book. One of the essays in this book addressed various methods for handling overload.... Read more
Laminar flow with ggplot2 and gganimate
Preface I’ve realized that all my previous posts were quite substantial in length and took quite a long time to create them. From this point forward I’ll be generating posts of shorter length (partially for my sanity and more for my impulsivity with ideas). A few of these posts won’t be... Read more
Using Excel for Data Entry
This article shows you how to enter data so that you can easily open in statistics packages such as R, SAS, SPSS, or jamovi (code or GUI steps below). Excel has some statistical analysis capabilities, but they often provide incorrect answers. For a comprehensive list of these limitations, see http://www.forecastingprinciples.com/paperpdf/McCullough.pdfand http://www.burns-stat.com/documents/tutorials/spreadsheet-addiction. Simple Data... Read more
When shuffling large arrays, how much time can be attributed to random number generation?
It is well known that contemporary computers don’t like to randomly access data in an unpredictible manner in memory. However, not all forms of random accesses are equally harmful. To randomly shuffle an array, the textbook algorithm, often attributed to Knuth, is simple enough: void swap(int arr, int i,... Read more
Building an Interactive Web “mapp” with Shiny
The purpose of this post is to discuss the key elements in developing an interactive web application that displays data with geographic component. I discuss developing an app using Shiny – a powerful R package. I briefly compare that process to building a similar product in Tableau. Rather than... Read more
Iterating over hash sets quickly in Java
There are many ways in software to represent a set. The most common approach is to use a hash table. We define a “hash function” that takes as an input our elements and produces as an output an integer that “looks random”. Then your element is stored at the... Read more
While you wait for that to finish, can I interest you in parallel processing?
caret has been able to utilize parallel processing for some time (before it was on CRAN in October 2007) using slightly different versions of the package. Around September of 2011, caret started using the foreach package was used to “harmonize” the parallel processing technologies thanks to a super smart guy named Steve Weston. I’ve done a... Read more
Word Vectors with Tidy Data Principles
Last week I saw Chris Moody’s post on the Stitch Fix blog about calculating word vectors from a corpus of text using word counts and matrix factorization, and I was so excited! This blog post illustrates how to implement that approach to find word vector representations in R using tidy data... Read more
It’s been a couple of weeks since I got accepted in the closed beta testing programme for IBM Data Science Experience (DSX), and it is about time I share my thoughts on this offering.DSX is a new product, which IBM is positioning as a new generation Data Science development and training... Read more
Plotting author statistics for Git repos using Git of Theseus
I spent a few days during the holidays fixing up a bunch of semi-dormant open source projects and I have a couple of blog posts in the pipeline about various updates. First up, I made a number of fixes to Git of Theseus which is a tool (written in Python) that... Read more