Teaching pivot / un-pivot

Teaching pivot / un-...

Co-written by John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy […]

5 Reasons Apache Spark is the Swiss Army Knife of Big Data Analytics

5 Reasons Apache Spa...

We are living in exponential times. Especially if we are talking about data. The world is moving fast and more data is generated every day. With all that data coming your way, you need the right tools to deal with the growing amounts of data. If you want to get any insights from all that […]

Scala, the Language for Data Science

Scala, the Language ...

Let’s be honest, there are two reasons why it’s worth learning a new programming language. The first reason is because you will need it for your daily job and the second reason is because it’s fun. The programming language Scala is something you would like to learn by the end of this post if you […]

Prediction Machine Designed with Spark, Kudu, and Impala

Prediction Machine D...

This was originally posted on the Silicon Valley Data Science blog. Why should your infrastructure maintain a linear growth pattern when your business scales up and down during the day based on natural human cycles? There is an obvious need to maintain a steady baseline infrastructure to keep the lights on for your business, but […]

NLP and Effective Topic Modeling with Spark MLLib

NLP and Effective To...

Abstract: The world communicates in text. Our work lives have us treading waist-deep in email, our hobbies often have blogs, our complaints go to Yelp, and even our personal lives are lived out via tweets, Facebook updates, and texts. There is a massive amount of information that can be heard from text — as long […]

Testing with Pyspark

Testing with Pyspark...

Apache spark and pyspark in particular are fantastically powerful frameworks for large scale data processing and analytics. In the past I’ve written about flink’s python api a couple of times, but my day-to-day work is in pyspark, not flink. With any data processing pipeline, thorough testing is critical to ensuring veracity of the end-result, so […]

Versatile Spark – Streaming

Versatile Spark R...

Distributed Computing is the fuel for large scale processing in modern data pipelines. Hadoop and its open-source competitors tool this system together. In recent years, rival Apache Spark gained favor due to its versatility. As preference for Apache grows, the software diversifies and its applications increase. Apache Spark is a product of the University of […]

Text modeling with R, Python, and Spark

Text modeling with R...

Text analysis with R, Python, and Spark, using the State of the Union Address and Congressional Hearings Frank D. Evans, data scientist at Exaptive, provides a conceptual and technical look at text analysis on big data with open source tools. His fodder is 70 years of State of the Union Addresses, from Truman to Obama, […]

Introduction to Spark

Introduction to Spar...

Spark is a general purpose cluster computing framework that provides efficient in-memory computations for large data sets by distributing computation across multiple computers. In this explicit walkthrough, Benjamin Bengfort guides the reader through installing Spark on a local machine or on EC2 and follows by a clear explanation of the nature and inner workings of […]