Neo4j on IBM POWER8 – Bigger Graphs and Better Performance

Neo4j on IBM POWER8 – Bigger Graphs and Better Performance

Today’s IT business leaders focused on BigData solutions have two big challenges - They need to manage massive volumes of data and they also need to rapidly generate insight from that data. As time has progressed we have been observing that one key insight is the existing and new relationships found by more deeply analyzing raw data matter alot ...

Resolution Tweet Breakdown

Resolution Tweet Breakdown

It's a new year everyone, which means new years resolutions. Many us are making the usual promises to ourselves about picking up or dropping certain habits. Thinking about new year's resolutions prompted us here at opendatascience.com to think about and analyze people's goals. While it's impossible to know everyone's resolution, the next best ...

Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning – Part 1

Model Evaluation, Model Selection, and Algorithm Selection in Mac...

 Introduction Machine learning has become a central part of our life – as consumers, customers, and hopefully as researchers and practitioners! Whether we are applying predictive modeling techniques to our research or business problems, I believe we have one thing in common: We want to make “good” predictions! Fitting a model to our training data ...

Even Further Beyond One-hot: Feature Hashing

Even Further Beyond One-hot: Feature Hashing

In the previous post about categorical encoding we explored different methods for converting categorical variables into numeric features.  In this post, we will explore another method: feature hashing. Feature hashing, or the hashing trick is a method for turning arbitrary features into a sparse binary vector.  It can be extremely efficient by ...

Why do Decision Trees Work?

Why do Decision Trees Work?

In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more ...

Sums of Consecutive powers, Bernoulli numbers, Riemann zeta, and Strange Sums

Sums of Consecutive powers, Bernoulli numbers, Riemann zeta, and ...

There’s a well-known formula for the sum of the first n positive integers: 1 + 2 + 3 + … + n = n(n + 1) / 2 There’s also a formula for the sum of the first n squares 12 + 22 + 32 + … + n2 = n(n + 1)(2n + 1) / 6 and for the sum of the first n cubes: 13 + 23 + 33 + … + n3 = n2(n + 1)2 / 4 It’s natural to ask whether there’s a ...

Nearest Neighbor Methods and Vector Models – part 1

Nearest Neighbor Methods and Vector Models – part 1

This is a blog post rewritten from a presentation at NYC Machine Learning. It covers a library called Annoy that I have built that helps you do (approximate) nearest neighbor queries in high dimensional spaces. I will be splitting it into several parts. This first talks about vector models, how to measure similarity, and why nearest neighbor ...

How to “Get Good at R”

How to “Get Good at R”

Editor's note: post modified from original How can I get good at R? This has come up enough times for me to outline my thoughts on the subject. That way I can simply forward people to this post the next time the question comes up. My advice is geared towards people who want to build an online portfolio that improves their career. This ...

Installing Jupyter with the PySpark and R kernels for Spark development

Installing Jupyter with the PySpark and R kernels for Spark devel...

This is a quick tutorial on installing Jupyter and setting up the PySpark and the R kernel (IRkernel) for Spark development. The pre-reqs for following this tutorial is to have a Hadoop/Spark cluster deployed and the relevant services up and running (e.g. HDFS, YARN, Hive, Spark etc.). In this tutorial I am using IBM's Hadoop distribution ...