Training + Business. Get your 2-for-1 deal to ODSC
East & CxO Summit before it expires on Friday.

This deal has timed out, but the next deal might just around the corner, or find a way to contact us about writing a blog and we'll talk. See you at ODSC East!

Use code: BUSINESS for an extra 20% Off

Streaming Video Analysis in Python

Streaming Video Anal...

This was originally posted on the Silicon Valley Data Science blog by authors Matthew Rubashkin Data Engineer at SVDS, and Colin Higgins, Data Scientist at Vevo. At SVDS we have analyzed Caltrain delays in an effort to use real time, publicly available data to improve Caltrain arrival predictions. However, the station-arrival time data from Caltrain was not […]

Image Processing in Python

Image Processing in ...

This was originally posted on the Silicon Valley Data Science blog.  The first step in developing our Caltrain project was creating a proof of concept for the image processing component of the device we used to detect passing trains. We’re big fans of Jupyter Notebooks at SVDS, and so we’ve created a notebook to walk […]

Introduction to Trainspotting

Introduction to Trai...

This was originally posted on the Silicon Valley Data Science blog. At Silicon Valley Data Science, we have a slight obsession with the Caltrain. Our interest stems from the fact that half of our employees rely on the Caltrain to get to work each day. We also want to give back to the community, and […]

Installing Jupyter with the PySpark and R kernels for Spark development

Installing Jupyter w...

This is a quick tutorial on installing Jupyter and setting up the PySpark and the R kernel (IRkernel) for Spark development. The pre-reqs for following this tutorial is to have a Hadoop/Spark cluster deployed and the relevant services up and running (e.g. HDFS, YARN, Hive, Spark etc.). In this tutorial I am using IBM’s Hadoop […]

Introducing Dask distributed #1

Introducing Dask dis...

tl;dr: We analyze JSON data on a cluster using pure Python projects. Dask, a Python library for parallel computing, now works on clusters. During the past few months I and others have extended dask with a new distributed memory scheduler. This enables dask’s existing parallel algorithms to scale across 10s to 100s of nodes, and extends a subset […]

Probability is hard: part 4

Probability is hard:...

This is the fourth part of a series of posts about conditional probability and Bayesian statistics. In the first article, I presented the Red Dice problem, which is a warm-up problem that might help us make sense of the other problems. In the second article, I presented the problem of interpreting medical tests when there is uncertainty about […]

Probability is hard: part three

Probability is hard:...

This is the third part of a series of posts about conditional probability and Bayesian statistics. In the first article, I presented the Red Dice problem, which is a warm-up problem that might help us make sense of the other problems. In the second article, I presented the problem of interpreting medical tests when there is uncertainty […]

Probability is hard, part two

Probability is hard,...

If you read the previous post, you know that my colleague Sanjoy Mahajan and I have been working on a series of problems related to conditional probability and Bayesian statistics.  In the previous article, I presented the Red Dice problem, which is relatively simple.  I posted it here because it presents four different versions of the […]

Probability is hard

Probability is hard...

For more than a month, my colleague Sanjoy Mahajan and I have been banging our heads on a series of problems related to conditional probability and Bayesian statistics.  We knew when we started that this material is tricky, as demonstrated by veridical paradoxes like the Monty Hall problem, the Girl Named Florida, and so on. […]