In 2015, AlphaGo – powered by machine learning (ML) – became the first computer program to beat a professional Go player (considered one of the most complex strategy games in existence). In 2017, AlphaGo Master, the next generation of the program, beat the No. 1 ranked player in the world at that time. Both AlphaGo and AlphaGo Master demonstrated that ML could surpass human performance.
Many people get excited and attach ML/artificial intelligence (AI) to every aspect of human activity, claiming these technologies will replace the majority of human jobs – even professional ones.
However, there are many challenges when applying ML to real-world applications. Imagine if AlphaGo could only see part of the Go board, or there were hidden rules not defined upfront. In the real world, applications are not usually defined as clearly as the black or white stones on the Go board. And in the industrial space, the environment is infinitely more complicated, where human behavior and machine operations are tangled with physical, chemical, and biological processes on mechanical, electrical, and electronic equipment. It introduces specific challenges to industrial AI applications from both algorithm and data perspectives.
An analogy: AI/ML to the digital revolution is like the steam-gas powered engine to the industrial revolution. Imagine each AI/ML algorithm is an engine. You need different kinds of engines for different applications; there is no one-size-fits-all. For example, an engine designed for Ferrari is not necessarily the best one suited for a tractor used in a farm.
- High fidelity requirement: There are some specific requirements for the AI engine for industrial applications. The most critical one is the high expectation around sensitivity and specificity. Take, for example, the online shopping recommendation system or a movie recommendation system. If you found a couple of things you liked in the listed recommendations, you may think this AI-driven functionality is amazing. But in a machine-failure prediction system, missing one failure will likely cause you to feel to question the reliability of the system, even though it catches the other 99% of failures. Why? One false prediction in an industrial environment may cause production loss, labor cost, and project delay, or even a catastrophic failure, which could cause millions of dollars in damages or lost production, environmental impacts or severe injuries.
- Explainable/actionable result requirement: Due to the high stakes on the table, engineers and technicians who have been working in the field for many years may not trust black box recommendations if they cannot explain how the predictions were made, because there are real-world consequences for each action taken (or not taken). In order to build trust, AI output must be explainable and actionable.
- Domain boundary requirement: AI/ML has to provide useful information within the boundary of domain knowledge. AI/ML relies on data and data are collected from physical systems that follow physical laws. Often I hear domain experts say, “Do not tell me something I already know or something does not make sense in my domain or violate certain ethics/standards/guidelines.”
Data challenges specific to Industrial AI
If we think of AI/ML as a gas engine, then data is the oil to power the AI/ML algorithms. Owning data is more valuable and crucial than owning algorithms – but there are many specific challenges associated with data in the industrial space.
Engines cannot consume crude oil, so an oil refinery is necessary to transform crude oil to clean gasoline. Industrial data has to go through a similar refining or cleaning process to be consumed by ML algorithms. During this process, domain knowledge is the key, and it’s that knowledge that decides how data is processed.
- Dirty data: First and foremost is the “dirty data” problem – every data scientist’s headache. This is not unique to industrial applications, but it is more complex than missing or redundant data. The data is collected with lots of noise (data is corrupted, or distorted, or has a low Signal-to-Noise Ratio, or other meaningless information) and varies from source to source. It’s one of the most challenging parts of the process due to environmental issues, budget constraints, human factors, and other limitations.
- Class unbalancing: For most ML algorithms to work, they need to be taught with examples; this is called training data. Training data includes all possible patterns with clearly labeled outcomes. For failure detection in industrial applications, there are no gold standards, and normal/faulty patterns are usually context-dependent. The reasons for machine failure continue to evolve, and there is no black and white boundary to create clear distinctions in each algorithm. Furthermore, failures are rare in the industrial environment due to all of the safety designs and features. This has two consequences: (1) not enough failure patterns in your training data set and (2) not all failures have data.
- Data labeling: Even if you have enough raw data, building a labeled training data set for ML algorithms is still challenging. For ML algorithms to learn, the dataset needs to be categorized into a good/bad class or multiple classes. However, there are not a lot of industrial experts available to find the failure patterns in data and they are expensive. Not every company has the luxury like GE to have a team of domain experts with decades of experiences in the industry.
- Tacit knowledge: Another big challenge is context and situational understanding. Not everything is recorded in a standardized data format. There is context information, tacit knowledge, and domain-specific information. For example, in maintenance logs in CMMS systems, engineers may use jargon and abbreviations to record failure modes, symptoms, and repair actions. Without proper domain knowledge, one may not fully understand the information in a maintenance log.
[Related article: Fastest Growing Sectors of AI Investment]
In this blog, we discussed how industrial data sets and industrial requirements raise challenges to AI/ML. To understand more about these specific challenges and best practices to address these challenges for industrial AI application design, consider attending our upcoming talk at the ODSC East 2020 Virtual Conference this April 14-17, “Challenges and Best Practices in Industrial AI Applications.”
Xiaohui (Mark) Hu is currently director, data & analytics at GE Digital, located in Foxboro, MA. He leads a team of data scientists and analytic engineers to design, develop, and support data science and analytic solutions for various industrial applications and software products. Mark received his doctor’s degree in electrical and computer engineering from Purdue University, Indiana and his bachelor’s degree from Tsinghua University, China respectively. His main research interests are machine learning/computational intelligence, data modeling, prognostics and health management, and industrial AI applications.