Google Dataset Search Launched to Help Analysts Scour Repositories
Google Dataset Search is a new product in the beta phase that you can use to find datasets published online. The single interface allows you to search repositories worldwide. Imagine you start a new analytics project. For example, let’s say you want to explore numbers pertaining to Boston Public Schools. Before... Read more
K-Means Clustering Applied to GIS Data
GIS can be intimidating to data scientists who haven’t tried it before, especially when it comes to analytics. On its face, mapmaking seems like a huge undertaking. Plus esoteric lingo and strange datafile encodings can create a significant barrier to entry for newbies. There’s a reason why there are experts who... Read more
Understanding the Hoeffding Inequality
If you read my last post on mathematically defining machine learning problems, then you’ll be familiar with the terminology here. Otherwise, I recommend you read that and then circle back here. The Hoeffding Bound is one of the most important results in machine learning theory, so you’d do well... Read more
A Short Summary of Smoothing Algorithms
When data are noisy, it’s our job as data scientists to listen for signals so we can relay it to someone who can decide how to act. To amp up how loudly hidden signals speak over the noise of big and/or volatile data, we can deploy smoothing algorithms, which... Read more
Machine Learning Approaches to Mobile Sensing Data to Make Self-Driving Cars Safer
Key Takeaways: Mobile sensing data from IoT devices have created opportunities for data scientists to better understand how we drive. Accelerometry and GPS data, for example, can be used to determine vehicular heading, acceleration, speed, climb, and other aspects of its motion. Machine learning and other data science techniques... Read more
The Art of Data Science in Spark
Apache Spark, or simply “Spark,” is a highly distributed, fault-tolerant, scalable framework that processes massive amounts of data. As it processes data, Spark abstracts the distribution of the data computations via a machine cluster thus enabling you to create applications using Java, Scala, Python, R, and SQL. Spark has... Read more
Survey Analysis in SQL and R
Charco Hui, as his Honours project in Statistics, has been writing a package for complex-survey analysis using dplyr and dbplyr. It’s here. At the moment it has only been tested with MonetDB, using the github version (0.5.2) of MonetDBlite, but it should work with many other databases (not SQLite, at the moment). I hope... Read more
Perl as Better grep
I like Perl’s pattern matching features more than Perl as a programming language. I’d like to take advantage of the former without having to go any deeper than necessary into the latter. The book Minimal Perl is useful in this regard. It has chapters on Perl as a better grep, a better awk,... Read more
Greater Speed in Memory-Bound Graph Algorithms with Just Straight C Code
Graph algorithms are often memory bound. When you visit a node, there is no reason to believe that its neighbours are located nearby in memory. In an earlier post, I showed how we could accelerate memory-bound graph algorithms by using software prefetches. We were able to trim a third... Read more
Emojis, Java and Strings
Emojis are funny characters that are becoming increasingly popular. However, they are probably not as simple as you might thing when you are a programmer. For a basis of comparison, let me try to use them in Python 3. I define a string that includes emojis, and then I... Read more