fbpx
An Introduction to Orchestrating Data Assets with Dagster
Editor’s note: Sandy Ryza is a speaker for ODSC West this November 1st-3rd. Be sure to check out his talk, “Orchestrating Data Assets instead of Tasks, with Dagster,” there! Dagster is an open-source data orchestrator: a framework for building and running data pipelines, similar to how... Read more
From Pandas to Features to Models to Predictions – A Deep Dive Into the Hopsworks APIs
When it comes to feature stores, there are two main approaches to feature engineering. One approach is to build a domain-specific language (DSL) that covers all the possible feature engineering steps (e.g., aggregations, dimensionality reduction, and transformations) that a data scientist might need. The second approach... Read more
5 Preferred Programming Languages for Web Scraping
Web scraping or web harvesting requires a good tool to be undertaken efficiently. It involves data crawling, content fetching, searching, parsing, as well as data reformatting to make the collected data ready for analysis and presentation. It is important to use the right software and languages... Read more
3 Ways to Protect Your Code from Software Supply Chain Attacks
Supply chain attacks are intended to benefit from the trust that has grown between a business and a select number of outside partners. Considering that businesses use a wide variety of third-party software for communication, meetings, and the deployment of websites, among other things, it is... Read more
Hopsworks 3.0: The Python-Centric Feature Store
Feature stores began in the world of Big Data, with Spark being the feature engineering platform for Michelangelo (the first feature store) and Hopsworks (the first open-source feature store). Nowadays, the modern data stack has assumed the role of Spark for feature stores – feature engineering... Read more
Don’t Sleep on SQL – 5 Reasons Why it’s a Must-Have Skill in 2022
While we mostly hear about Python, R, and Julia in regards to coding for data science, SQL (Structured Query Language) still has its place as a fundamental skill that supplements more popular languages. Given its ease of use and ability to quickly get started, its versatile... Read more
Embedding Interactive Python Plots on the Web
One of the most important steps in the Data Science pipeline is Data Visualization. In fact, thanks to Data Visualization, Data Scientists can be able to quickly gather insights about the data they have available and any possible anomaly. Traditionally, Data Visualization consisted of creating static... Read more
Top 9 Most Essential Python Libraries For Beginners
People worldwide know Python as the most used programming language to date. Major tech companies like Google, Amazon, Meta, Instagram, and Uber use Python for various applications. From web development to machine learning projects, Python is an essential tool in a data scientist’s kit. Many understand... Read more
What to Consider When Building Data Pipelines
In 2021 we watched Fivetran raise $565 million, Airbyte $150 Million, Matillion $100 million, Rivery raised $16 million and Informatica went public. All of these companies have some piece of their business connected to data pipelines. Also sometimes referenced as ETL, ELT, E(t)LT, and CDC. For... Read more
Testing Within the Shift-Left Philosophy
Traditionally, application testing was carried out during the last stages of the software development life cycle, that is after the application had been completed and then handed to the security teams. If an application did not satisfy quality standards, did not function properly, or otherwise failed... Read more