If you review the schedule of events at data science conferences, such as ODSC West, you’ll notice that data engineering is a primary focus. That’s no surprise, as data engineer is currently one of the fastest-growing roles. Job site, DICE’s, recent 2020 Tech Job Report listed it as the fastest-growing job with 50% growth in 2019. While the “rise of the data engineer,” is a popular phrase, the demand for these engineers has long outstripped the supply. If DICE’s data is correct, the demand for them will continue to do so.
Fig 1. source dice.com
What is a data engineer?
Data engineers are responsible for the overall data infrastructure that supports business products and services. Companies are gathering ever-increasing amounts of data that drive everything from business intelligence dashboards to platform workflows. They design the infrastructure that stores, moves, and integrates data from many different sources. They create environments for data scientists to analyze data (lakes and warehouses), enable ETL at scale, and optimize the ecosystem to ensure continuous insights.
Experts estimate that it takes two to three data engineer jobs per data science job to help maintain that pipeline, driving the high demand for these engineers.
What are the Core Competency Skills?
To better understand who employers are seeking for data engineering jobs, we examined our own jobs portal as well as other sources. In total, we reviewed over 1,350 job postings for Q2 and Q3, 2020 for this role.
Fig 2 source: odsc.com
Definitions of a data engineer role vary, but generally it is someone who:
- Understands data platforms, infrastructure frameworks, and data pipeline workflows.
- Knows how to source, transform, and analyze data at scale.
- Employs programming skills and advanced techniques like machine learning to create the “glue” code that binds the data workflow life-cycle.
Fig 2. Is a representative set of skills sourced from our job portal for this role. Unsurprisingly, data skills, such as database, SQL, NoSQL, and ETL (extract, transform, load) feature prominently. From the postings, it is evident that this role’s domain is in the cloud and that experience on AWS or Azure is a top requirement.
Additionally, the jobs posted indicated that the ability to engineer data at scale with a mixture of pipelines and utilize other infrastructure tools is a defining feature of this role. Knowledge of areas, such as distributed systems, APIs, big data, and data lakes, as well as of platforms, such as Hadoop, Apache Spark Kafka, and Airflow is also important.
Programming skills also feature prominently in job postings with python leading the pack. Java made the cut also, which is unsurprising since it has been the main infrastructure language for more than a decade. Scala, the language of Apache Spark, is also in high demand. Languages that many data sciences favor, such as R, did not feature prominently. However, machine learning, an expansive topic in itself, was one of the top requirements for a data engineering position.
According to this report, there is an overlap in skills between a data engineer and a machine learning engineer. This is not surprising as these data engineers need machine learning and analysis skills to build environments that facilitate large scale AI-driven projects.
Who’s Hiring Now?
Fig 3. Who’s Hiring Now
To determine who’s hiring data engineers in the USA, we looked at a sample set from various job sites in August and September. Big tech including Amazon, Facebook, and Apple are actively recruiting. The finance industry is weathering the pandemic better than many and that’s reflected in strong hiring by Capital One, JPMorgan Chase, and USAA among others. Healthcare companies like United Healthcare and CVS Health are also hiring teams of data engineers. Interestingly, consulting companies Capgemini and Accenture are also hiring as many industries seek guidance on navigating the crisis.
Paths to Data Engineering & Upskilling
The Paths to a data engineering role vary. Upcoming virtual training conferences, such as ODSC West, are ideal for individuals looking to break into the field, or acquire the latest skills. ODSC West features a full track devoted to Data Engineering and MLOps including
Introductory level topics for beginners and intermediate and advanced level topics for experienced engineers, such as:
- End-to-end AI Application Development with Programmatic Supervision
- Data Science: How Do We Achieve the Most Good and Least Harm?
- MLOps in DL Model Development
- Model Governance: A Checklist for Getting AI Safely to Production
- Rapid Data Exploration and Analysis with Apache Drill – Beginner-Intermediate
- Automated Model Management with ML Works – it’s demo talk – these sessions do not have levels it’s for all
- How to Increase ML Server Utilization With MLOps Visualization Dashboards
- Lessons from KPI Monitoring and Diagnosis at Scale
- Prioritize ML Operations at Any Maturity Level