Getting Started with Pandas
Pandas is a popular data analysis library built on top of the Python programming language. It assists with common manipulations for data cleaning, joining, sorting, filtering, deduping, and more. First released in 2009, pandas now sits as the epicenter of Python’s vast data science ecosystem and is an essential... Read more
Removing Items From a Set – remove(), pop(), and difference
Python has a rich collection of built-in data structures. These data structures are sometimes called “containers” or “collections” because they contain a collection of individual items. These structures cover a wide variety of common programming situations. In this recipe, we’ll look at how we can update a set by removing or replacing... Read more
Retrieving Webpages Through Python Programming
The internet and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence the name Hypertext Transfer Protocol), which eventually started the WWW.   This process occurs every time we request a web page through our devices. The... Read more
Smoothing Data in SQL
A problem found throughout the world of data is how to distinguish signal from noise. When dealing with data that comes in a sequence, such as time-series data (the most familiar example but by no means the only example), a frequent method of dealing with the problem is to... Read more
An Introduction to AWS Networking – Virtual Private Cloud
Cloud computing is one of the major trends in computing today and has been for many years. Public cloud providers have transformed the start-up industry and what it means to launch a service from scratch. We no longer need to build our own infrastructure; we can pay public cloud... Read more
A Data Pattern with an R data.table Solution.
Summary: This blog examines a loading pattern seen often with government-generated, web-accessible data. The data comprise millions of records across multiple text or csv files, generally demarcated by time. The files may present different, but overlapping, attributes, while much of the data has a character representation, posing the challenge... Read more
Data Manipulation in R
Not all datasets are as clean and tidy as you would expect. Therefore, after importing your dataset into RStudio, most of the time you will need to prepare it before performing any statistical analyses. Data manipulation can even sometimes take longer than the actual analyses when the quality of the... Read more
Creating if/elseif/else Variables in Python/Pandas
Frequencies and Chaining in Python-Pandas
A few years ago, in a Q&A session following a presentation I gave on data analysis (DA) to a group of college recruits for my then consulting company, I was asked to name what I considered the most important analytic technique. Though a surprise to the audience, my answer,... Read more
The goal of a data analysis pipeline in Python is to allow you to transform data from one state to another through a set of repeatable, and ideally scalable, steps. Problems for which I have used data analysis pipelines in Python include: Processing financial / stock market data, including... Read more