If you review the schedule or contents of any data science learning platform, such as with the Ai+ Training Platform, you’ll notice that data engineering is a primary focus. That’s no surprise, as data engineer is currently one of the fastest-growing roles. Job site, DICE’s, recent 2020 Tech Job Report listed it as the fastest-growing job with 50% growth in 2019. While the “rise of the data engineer,” is a popular phrase, the demand for these engineers has long outstripped the supply. If DICE’s data is correct, the demand for them will continue to do so.
Fig 1. source dice.com
What is a data engineer?
Data engineers are responsible for the overall data infrastructure that supports business products and services. Companies are gathering ever-increasing amounts of data that drive everything from business intelligence dashboards to platform workflows. They design the infrastructure that stores, moves, and integrates data from many different sources. They create environments for data scientists to analyze data (lakes and warehouses), enable ETL at scale, and optimize the ecosystem to ensure continuous insights.
Experts estimate that it takes two to three data engineer jobs per data science job to help maintain that pipeline, driving the high demand for these engineers.
To better understand who employers are seeking for data engineering jobs, we examined our own jobs portal as well as other sources. In total, we reviewed over 1,350 job postings for Q2 and Q3, 2020 for this role.
Fig 2 source: odsc.com
Definitions of a data engineer role vary, but generally it is someone who:
- Understands data platforms, infrastructure frameworks, and data pipeline workflows.
- Knows how to source, transform, and analyze data at scale.
- Employs programming skills and advanced techniques like machine learning to create the “glue” code that binds the data workflow life-cycle.
Fig 2. Is a representative set of skills sourced from our job portal for this role. Unsurprisingly, data skills, such as database, SQL, NoSQL, and ETL (extract, transform, load) feature prominently. From the postings, it is evident that this role’s domain is in the cloud and that experience on AWS or Azure is a top requirement.
Additionally, the jobs posted indicated that the ability to engineer data at scale with a mixture of pipelines and utilize other infrastructure tools is a defining feature of this role. Knowledge of areas, such as distributed systems, APIs, big data, and data lakes, as well as of platforms, such as Hadoop, Apache Spark Kafka, and Airflow is also important.
Programming skills also feature prominently in job postings with python leading the pack. Java made the cut also, which is unsurprising since it has been the main infrastructure language for more than a decade. Scala, the language of Apache Spark, is also in high demand. Languages that many data sciences favor, such as R, did not feature prominently. However, machine learning, an expansive topic in itself, was one of the top requirements for a data engineering position.
According to this report, there is an overlap in skills between a data engineer and a machine learning engineer. This is not surprising as these data engineers need machine learning and analysis skills to build environments that facilitate large scale AI-driven projects.
Who’s Hiring Now?
Fig 3. Who’s Hiring Now
To determine who’s hiring data engineers in the USA, we looked at a sample set from various job sites in August and September. Big tech including Amazon, Facebook, and Apple are actively recruiting. The finance industry is weathering the pandemic better than many and that’s reflected in strong hiring by Capital One, JPMorgan Chase, and USAA among others. Healthcare companies like United Healthcare and CVS Health are also hiring teams of data engineers. Interestingly, consulting companies Capgemini and Accenture are also hiring as many industries seek guidance on navigating the crisis.
Paths to Data Engineering & Upskilling
The Paths to a data engineering role vary. With the Ai+ Training Platform, you gain access to our massive library of data science training courses, workshops, keynotes, and talks. All skills are ideal for those looking to break into the field or to acquire the latest skills needed to get ahead. Some highlighted courses include:
SQL for Data Science: Mona Khalil | Senior Data Scientist | Greenhouse
Data Science in the Industry: Continuous Delivery for Machine Learning with Open-Source Tools: Team from ThoughtWorks, Inc.
How to do Data Science with Missing Data: Matt Brems | Managing Partner, Distinguished Faculty | BetaVector, General Assembly
Continuously Deployed Machine Learning: Max Humber | Lead Instructor | General Assembly