K-Means Clustering Applied to GIS Data
GIS can be intimidating to data scientists who haven’t tried it before, especially when it comes to analytics. On its face, mapmaking seems like a huge undertaking. Plus esoteric lingo and strange datafile encodings can create a significant barrier to entry for newbies. There’s a reason why there are experts who... Read more
Understanding the Hoeffding Inequality
If you read my last post on mathematically defining machine learning problems, then you’ll be familiar with the terminology here. Otherwise, I recommend you read that and then circle back here. The Hoeffding Bound is one of the most important results in machine learning theory, so you’d do well... Read more
A Short Summary of Smoothing Algorithms
When data are noisy, it’s our job as data scientists to listen for signals so we can relay it to someone who can decide how to act. To amp up how loudly hidden signals speak over the noise of big and/or volatile data, we can deploy smoothing algorithms, which... Read more
Machine Learning Approaches to Mobile Sensing Data to Make Self-Driving Cars Safer
Key Takeaways: Mobile sensing data from IoT devices have created opportunities for data scientists to better understand how we drive. Accelerometry and GPS data, for example, can be used to determine vehicular heading, acceleration, speed, climb, and other aspects of its motion. Machine learning and other data science techniques... Read more
The Art of Data Science in Spark
Apache Spark, or simply “Spark,” is a highly distributed, fault-tolerant, scalable framework that processes massive amounts of data. As it processes data, Spark abstracts the distribution of the data computations via a machine cluster thus enabling you to create applications using Java, Scala, Python, R, and SQL. Spark has... Read more
Survey Analysis in SQL and R
Charco Hui, as his Honours project in Statistics, has been writing a package for complex-survey analysis using dplyr and dbplyr. It’s here. At the moment it has only been tested with MonetDB, using the github version (0.5.2) of MonetDBlite, but it should work with many other databases (not SQLite, at the moment). I hope... Read more
Perl as Better grep
I like Perl’s pattern matching features more than Perl as a programming language. I’d like to take advantage of the former without having to go any deeper than necessary into the latter. The book Minimal Perl is useful in this regard. It has chapters on Perl as a better grep, a better awk,... Read more
Greater Speed in Memory-Bound Graph Algorithms with Just Straight C Code
Graph algorithms are often memory bound. When you visit a node, there is no reason to believe that its neighbours are located nearby in memory. In an earlier post, I showed how we could accelerate memory-bound graph algorithms by using software prefetches. We were able to trim a third... Read more
Emojis, Java and Strings
Emojis are funny characters that are becoming increasingly popular. However, they are probably not as simple as you might thing when you are a programmer. For a basis of comparison, let me try to use them in Python 3. I define a string that includes emojis, and then I... Read more
SQL on Hadoop, BigQuery, or Exadata. Please don’t call them MPP.
I often hear people referring to SQL engines running against HDFS or object storage as MPP. Strictly speaking this is incorrect. Let me first explain what an MPP database is and then explain why engines such as Presto etc. should not be called an MPP engine. MPP In an... Read more
Open Data Science - Your News Source for AI, Machine Learning & more