Why Blockchain Will Improve Your Big Data
The rise of cloud storage has helped companies collect and manage massive amounts of data. Data comes from corporate systems, Internet of Things objects and unstructured sources like online forums. New analytics tools like Hadoop help companies make sense of that data. Yet simply having data... Read more
Custom Level Coding in vtreat
One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA “one-hot encoding”). Level coding can be computationally and statistically preferable to... Read more
This post is the first of a two-part series in which we apply NLP techniques to analyze articles about big data, data science, and AI. If you are tired of the hassles of web scraping, then this post might be just for you. I occasionally web scrape news... Read more
Firing on All Cylinders: The 2017 Big Data Landscape, part 2
A walk through the 2017 Data Ecosystem Landscape INFRASTRUCTURE A lot of themes from last year have continued to play out, such as the ever-increasing importance of streaming, with Spark reigning supreme for now, with interesting contenders such as Flink emerging. In addition, a few interesting themes have kept... Read more
Datasets for Building a Data Analysis Portfolio
I recently had the pleasure of attending the 2017 Association of Public Data Users (APDU) Conference. My favorite part of the conference was talking to people who work with federal data on a daily basis. Overall I found people to be passionate about their work and eager... Read more
Beyond Computational Reproducibility, let us Aim for Reusability
Scientific progress calls for reproducing results. Due to limited resources, this is difficult even in computational sciences. Yet, reproducibility is only a means to an end. It is not enough by itself to enable new scientific results. Rather, new discoveries must build on reuse and modification... Read more
It feels good to be a data geek in 2017. Last year, we asked “Is Big Data Still a Thing?”, observing that since Big Data is largely “plumbing”, it has been subject to enterprise adoption cycles that are much slower than the hype cycle. As a result,... Read more
Web Scraping Indeed for Key Data Science Job Skills
Editor’s Note: Check out our 2017 State of Data Science Jobs Report to compare stats, sentiments, and POVs. *available in Spanish   As many of you probably know, being a data scientist requires a large skill set . . . Read more
Scraping OpenStreetMap and exploring POI in Cloudant and Jupyter Notebooks When working with data, the format of the raw data is not always user-friendly. For instance, the format could be one large binary file, or the data could spread across hundreds of text files. An easy... Read more
How hard can it be to compute conversion rate? Take the total number of users that converted and divide them with the total number of users. Done. Except… it’s a lot more complicated when you have any sort of significant time lag. Prelude — a story Fresh out... Read more