Dimensional Modeling and Kimball Data Marts in the Age of Big Data and Hadoop
Is dimensional modeling dead? Before I give you an answer to this question let’s take a step back and first have a look at what we mean by dimensional data modelling. Why do we need to model our data? Contrary to a common misunderstanding, it is not the only... Read more
Blockchain Technology is Hot Right Now, But Tread Carefully.
In 1968, Dr. Spencer Silver, a chemist at 3M Company, was tasked with creating a super-strong adhesive. Despite his efforts, his chemical compound which only achieved moderate stickiness could not find a problem for his solution. That was until his colleague, Art Fry, noticed he could use the glue... Read more
Data Ingestion with Spark and Kafka
This was originally posted on the Silicon Valley Data Science blog. An important architectural component of any data platform is those pieces that manage data ingestion. In many of today’s “big data” environments, the data involved is at such scale in terms of throughput (think of the Twitter “firehose”) or volume (e.g., the... Read more
Full Stack Data Science at ODSC
Register now for ODSC West and save 60% with code KD60 until September 1st. Data Science is built on a rapidly expanding stack. Let me explain by using a software analogy. To build functioning web apps you need a data store, model (or business) layer, some kind of message layer,... Read more
The IoT and AI Are Breaking Down Old Application Categories
For the many years that I have been researching IT, there has always been a clear distinction between certain types of applications. One, for example, distinguished “B to B” (business to business) applications from those for “B to C” (business to consumer).  B to B applications were for business... Read more
Intro to Caret, Model Training and Tuning
Contents Model Training and Parameter Tuning An Example Basic Parameter Tuning Notes on Reproducibility Customizing the Tuning Process Pre-Processing Options Alternate Tuning Grids Plotting the Resampling Profile The trainControl Function Alternate Performance Metrics Choosing the Final Model Extracting Predictions and Class Probabilities Exploring and Comparing Resampling Distributions Within-Model Between-Models Fitting Models... Read more
Intro to Caret: Data Splitting
Contents Simple Splitting Based on the Outcome Splitting Based on the Predictors Data Splitting for Time Series Data Splitting with Important Groups 4.1 Simple Splitting Based on the Outcome The function createDataPartition can be used to create balanced splits of the data. If the yargument to this function is a factor, the random... Read more
Graft.jl – General purpose graph analytics for Julia
This blog post describes my work on Graft.jl, a general purpose graph analysis package for Julia. For those unfamiliar with graph algorithms, a quick introduction might help. Proposal My proposal, titled ParallelGraphs, was to develop a parallelized/distributed graph algorithms library. However, in the first month or so, we decided to work towards a... Read more
Git-Pandas caching for Faster Analysis
Git-pandas is a python library I wrote to help make analysis of git data easier when dealing with collections of repositories.  It makes a ton of cool stuff easier, like cumulative blame plots, but they can be kind of slow, especially with many large repositories. In the past we’ve made that... Read more
Cognitive Computing
What is Cognitive Computing?  Most probably anyone who is even remotely aware of the nature of contemporary Data Science landscape will recognize the truth of the following two statements: (a) Data Wrangling is necessary with almost every new project, and (b) Data Wrangling is difficult and tedious. Following all the investment and enthusiasm... Read more