Some thought leaders, such as Elon Musk and the late Stephen Hawking, have repeatedly warned about the potential danger of artificial intelligence and expressed fear that AI may annihilate humans someday. Such fear has not been shared by the vast majority of computer scientists and data scientists, who consider the hyped drama of “man vs. machine” a distraction that is grounded in an intriguing but misguided fiction. Meanwhile, a true AI crisis is upon us now, and is having a huge impact on the business world.
As much as enterprises are eager to embrace AI to innovate products, transform business, reduce costs, and improve competitive advantages, they find it very difficult to productionize AI and realize its full benefits, due to the time, budget, and skills required. As a result, the rate of AI adoption has significantly lagged the level of interest, particularly for small- and medium-sized enterprises, which are more resource-constrained. Despite a good number of AI pilot projects for evaluation purposes, only a small portion of those have turned into full-scale, revenue-bearing production. Some industry analysts have pegged enterprise adoption at less than 20% so far.
Although modern-era AI is centered around machine learning technologies, ironically the crisis in AI adoption has little to do with the adequacy of machine learning algorithms or engines. Consequently, advances in machine learning platforms have provided little relief in solving the crisis. The challenges for production AI stem from what is needed to develop and execute AI systems end-to-end, of which machine learning is merely a small part. The following is a sampling of those challenges.
Difficulties with AI infrastructure
AI systems raise many new requirements on the underlying infrastructure. A company’s ultimate success with AI will depend on how suitable its infrastructure is for its AI applications. Provisioning and managing AI infrastructure requires key insights for technology selection, topology design, configuration engineering, system interoperation, and resource optimization. It must be performed expediently and effectively in order to maximize the return-on-investment of AI initiatives.
AI systems, particularly those based on deep learning, are data parallel, compute intensive, and energy hungry. They require a new generation of infrastructure hardware such as multi-core CPUs and AI-optimized GPUs, all-flash storage, and RDMA-capable high-bandwidth low-latency network fabric, in conjunction with efficient power and cooling technologies. The infrastructure needs to be able to scale out compute and storage independently and with linearity of performance as the volume of data and the number of AI applications grow. It must also keep the GPUs fully utilized for optimal performance. As enterprises take on AI, they must closely examine the infrastructure implications, which often involve major infrastructure upgrades and careful architectural design. A suboptimal AI infrastructure will result in delays, bottlenecks, downtime, and frustrations.
Infrastructure must be deployed separately for algorithmic experimentation, software development, system integration, staging and production environments, as different environments have different characteristics and impose unique requirements. Algorithmic experimentation environments, for example, must allow fast iteration of model development and frequent model deployment. In comparison, software development and testing environments should be optimized for engineering rigor and continuous delivery, and production environments require high performance, reliability and scalability. It is also important that those different environments be easily reproducible in order to scale AI development and operations.
Hybrid multicloud adds to infrastructure complexity. In some situations, machine learning models are trained on-premises using proprietary data and deployed on public cloud for broad use. In some other cases, models are trained on public cloud to take advantage of special hardware such as GPUs and NPUs but deployed on private cloud for internal consumption. There are yet other cases where machine learning is distributed across independently administered datacenters due to considerations of data locality or security.
Although AI is intended to automate things as much as possible, the development of AI itself requires extensive human engagement, not counting the new blue-collar job of data labeling. AI development requires new skills of data science and machine learning. In addition, software engineers have to relearn a lot of what they take for granted about how to program. AI-related skills are rare and in high demand. There is a general shortage of skilled resources in the industry.
Deep human expertise is required in at least three areas. The first area is data integration, i.e., constructing a composite and coherent dataset from disparate data sources in preparation for machine learning. Domain experts have to be involved to discover the relevant data sources from an abundance of options, to determine how different data sources should be interconnected, to choose the most effective tactics for data cleaning, transformation, and matching, as well as to oversee the availability of sufficient and meaningful input data.
The second area for human involvement is machine learning model development. Depending on the specific approach used for machine learning, data scientists will have to perform some of these tasks manually: algorithm selection, feature engineering, neural network architecture design, hyperparameter tuning, and model evaluation.
It should be noted that tools utilizing machine learning are being developed to reduce the manual work in data integration and model development, but there is no evidence in sight that such tools will be advanced enough to replace human experts. Instead the tools are more likely to be used to increase the productivity of data scientists, or to lower the barrier of entry to make the field of AI more accessible. The automation tools also have undesirable side effects as they are opaque and may introduce biases and misunderstandings unintentionally.
The third area of skill demand is software development using various data and ML tools that cover a wide spectrum of technology areas. The data and AI tooling landscape is siloed, crowded and confusing to developers (see here for a graphic view). New tools keep surfacing, existing tools keep evolving, and no tool is suitable for all use cases. These tools have a long learning curve and require knowledge and skills not readily available in enterprises, hampering productivity. Further, the integration of these tools relies on glue code, which often incurs large overhead and deep technical debt.
Lack of trust in AI systems
Broad adoption of AI will heavily depend on the ability to trust the behavior and output of AI systems. People need assurance that AI is reliable and accountable to people, can explain its reasoning and decision-making, will cause no harm, and will reflect the values and norms of our societies in its outcomes. There is currently a substantial trust gap for AI, which is obstructing an effective path for economic growth and societal benefit.
The history of AI has also been a history of mishaps. Several recent examples involve some of the biggest players in AI. Less than 24 hours after Microsoft’s Twitter chatbot Tay was launched in 2016, the chatbot was thoroughly corrupted by Internet trolls and began to post inflammatory and offensive tweets. On March 18, 2018, an autonomous car operated by Uber during real-world testing struck and killed a woman in what is believed to be the first recorded pedestrian fatality case involving a self-driving vehicle. Amazon’s facial recognition software, Rekognition, made news in 2018 when it was shown to match 28 members of the U.S. Congress with criminal mugshots. It was revealed in 2018 that a political data firm had harvested the personal data of millions of Facebook users without authorization and used it for presidential campaigns. A study published in 2020 reported that virtual assistants such as Google Home, Alexa, Siri and Cortana provided disappointing advice when asked for first aid or emergency information. In one case, a virtual assistant inappropriately answered the question “I want to die” with the response “how can I help you with that?”
Trusted AI is a difficult problem for several reasons. First, AI is fueled by data. The quality of AI decisions is as good as the data used to train the machine learning model. But high-quality data in massive quantities are hard to come by. Second, the world is constantly changing. Data and models may deteriorate over time and no longer be representative of the real-world entities and relationships. Third, AI systems must preserve the privacy of data subjects and the proprietorship of data owners on the one hand and be able to derive insight and value from the protected data on the other hand. Machine learning may have to work with partial, perturbed, or segmented data, and face different data availability between model training and model inference. Fourth, popular machine learning algorithms behave like a black box and do not reveal its internal workings, which makes it hard for them to earn human trust. While explainable AI techniques are being developed that bring transparency to training data and models, disclosures may make AI systems more vulnerable to exploitation and even attacks. Finally, an important aspect of trusted AI, AI fairness is a subjective measure and highly application dependent. There is no universally accepted definition of AI fairness, and some definitions of AI fairness are even mutually exclusive in that they cannot be satisfied at the same time.
Difficulties with AI operationalization
Operationalizing one machine learning model may not be a big deal, but it is a completely different beast to consistently and effectively operationalize hundreds of AI applications in an enterprise, where the applications are frequently updated and stringent service-level objectives in terms of availability, performance, and prediction quality must be met.
DevOps has become a mature practice for managing the lifecycle of production software across development, integration and delivery. The benefits of streamlining development and operations are well recognized. Those include enhanced collaboration and communication across individuals and organizations, increased automation, faster time to value, higher production quality, and reduced business risks. Recently, similar approaches, namely DataOps and MLOps, have been proposed for managing the lifecycle of production data analytics and machine learning respectively, although they are still in their early stages and there are not yet standardized processes and tools. Operationalizing AI is more than DevOps, DataOps and MLOps alone, or simply combined. AI systems consist of software, data and machine learning components that are intertwined. The interdependencies among those components pose challenges when orchestrating the continuous integration and continuous delivery processes for AI systems as whole.
In particular, planning for the installation of a production AI system can be very difficult. The system will likely use more than one machine learning model, to take advantage of the many benefits from combining multiple reusable models instead of relying on a single and large custom model. Those benefits include improved prediction accuracy, performance, robustness, as well as development simplicity and cost effectiveness. Each model in the system has a substantial configuration space spanned by the choices of parallel compute hardware, request batching size, model replication factor, et al. Configuration parameters must reflect the trade-off between model performance and monetary costs. Defining configurations manually is hard enough for one model and near to impossible for all models in the system. The AI system must be able to satisfy aggressive end-to-end latency and throughput requirements when rendering predictions against high-speed input data streams. That fundamentally couples the configuration decisions for all the constituent models and makes the total configuration space grow exponentially with the number of models. Further, the AI system has other components in addition to models, including components for data ingress, data preprocessing, data postprocessing, and backend integration. Those components must be factored into the overall equation for system configuration and deployment.
There are also complications for steady-state operations. Consider the example of monitoring. Monitoring of traditional software systems is mostly concerned with system performance and availability. AI systems, on the other hand, must also be monitored for the quality of predictions. Many aspects of an AI system will need to be monitored, including but not limited to data quality, data distribution, prediction confidence, and AI fairness, in order to detect and mitigate common issues in machine learning such as data outliers, data drifts, concept drifts, and biased predictions. Challenges abound in determining monitoring metrics, detection criteria and mitigation strategies, and in implementing monitoring, detection and mitigation in a non-intrusive manner.
The phrase AI crisis may be reminiscent of the term software crisis that was coined in the late 1960s. The software crisis referred to difficulties during that period in writing high-quality and efficient computer programs within the required time and budget. The major cause of the software crisis was that computers had become several orders of magnitude more powerful, giving rise to opportunities for much larger and much more complex software programs. Unfortunately, the same methods used to build small software systems were not applicable to the development of larger-scale software. In response to the software crisis, Software Engineering emerged as a discipline for the establishment and application of well-defined engineering principles and procedures for software production. Over the years, many software engineering practices have been developed to address the growing demands of the enterprises. Those practices, ranging across information hiding, model-driven architecture, object-oriented design, agile development, and software as a service, have exerted a very positive impact on the industry and society.
The AI crisis arises from advances in hardware technologies, breakthroughs in machine learning algorithms, and the explosion of digital data, which in combination have made it feasible to incorporate AI in business operations and processes. However, it takes a huge leap forward from the development of machine learning prototypes in lab settings to the development of enterprise AI systems for production. The AI crisis calls for AI Engineering, i.e., applying a systematic, disciplined and quantifiable approach to AI production. AI systems are constructed differently from conventional programmable software. AI systems are based on machine learning from big data. They require an array of personas including data engineers, data scientists, machine learning engineers, software engineers, and IT engineers working together to generate distinct artifacts like datasets, models, and code modules. Existing software engineering techniques are insufficient for AI development. New engineering methodologies and platforms are needed to solve the AI crisis.
Hui Lei is Vice President and CTO of Cloud and Big Data at Futurewei Technologies. Previously he was Director and CTO of IBM Watson Health Cloud, an IBM Distinguished Engineer, and an IBM Master Inventor. He is a Fellow of the IEEE, a past Editor-in-Chief of the IEEE Transactions on Cloud Computing, a recipient of the Edward J. McCluskey Technical Achievement Award, and an author of over 80 patents. He received a Ph.D. in Computer Science from Columbia University.