Databases for the Era of Artificial Intelligence
ModelingNLP/Text Analyticsposted by ODSC Community April 17, 2023 ODSC Community
Article originally posted here by Misha Obolonskyi. Reposted with permission.
ChatGPT can carry on plausible conversations, write poetry in the style of Dylan Thomas or Shakespeare, and might do as good a job writing this blog post as I will! But without vector search, generative AI can’t reach its full potential.
Human-like outputs call for a human-like memory. Hence, our interest in vector databases, which store data in a manner that is functionally analogous to the way we humans do. However while generative AI needs vector search to fulfill its promise, vector search does not need generative AI to create value. It does that in countless other ways as well.
When you hear the phrase, “data infrastructure” it’s likely that what comes to mind is some kind of relational database. Depending on your age, you may envision databases of the type pioneered by Microsoft and Oracle around 1995, or ones made more robust thanks to Apache Hadoop’s open-source approach to distributed computing. Or, you may envision data residing in the cloud courtesy of companies like Snowflake and Databricks. Collectively the companies in this space represent about a trillion dollars in market value.
But this post is about the data that can’t be practically stored in relational databases – which is to say, most of it.
The challenge is the opportunity: 80% of the world’s data is unstructured.
Companies, institutions, governments, and organizations of all types have accumulated something on the order of 100 zettabytes of unstructured data. This data comes in the form of text in every language, images and video, spoken language and music, and even genetic or molecular information.
It’s not just that there is more unstructured than structured data. The amount of unstructured data is growing at about five times the rate of structured data. And, although it’s unstructured, it often has more contextualized meaning and value than can be gleaned from some row on a spreadsheet.
That brings me back to the “$100 billion company” claim I cited earlier. Since the dot-com crash, we created $1 trillion in value on the strength of structured data through Oracle Database products, Microsoft, Snowflake, and others. How much value can we unlock from the vastly larger, informationally richer, and faster-growing base of unstructured data?
One of the reasons why so much of the value in unstructured data has remained inaccessible is that the complexity and scale of systems designed to store and search through unstructured data – to say nothing of the shortage of ML engineers – meant that only the largest tech companies could afford to develop the technology. Ebay developed a Visual Search system in 2017, Google first announced embeddings-based models in 2014, and open-sourced BERT in 2018; Netflix’s and TikTok’s personalized search and recommendation engines are other examples.
But the 2020s will be the decade of unstructured data, and there’s a huge opportunity for innovators, founders, and investors because several factors are coming together to democratize the ability of enterprises at any scale to extract value from their unstructured data:
- A widely shared understanding of the value of unstructured data.
- The availability of affordable and accessible compute to process larger volumes of data.
- The wide adoption of transformer and image models developed by OpenAI, Stability, and HuggingFace has significantly reduced the cost of converting unstructured data, such as an image, into embeddings or vectors.
- The rise of generative AI and particularly understanding that vector databases are becoming a fundamental part of an intelligence stack.
Vector databases: The basics
Computers process data numerically, so to extract contextual meaning from an image, it must be processed by an AI model trained on numerous similar images. The model converts the image into a numerical vector through operations like convolutions and pooling.
By storing vectors in a database and using clustering algorithms, we can group them based on similarity. This allows us to query the database and retrieve semantically similar objects from the initial images, text, or video representations.
Why vector databases matter to businesses
Vector databases allow companies to access and analyze massive amounts of unstructured data, giving them a better understanding of their customers. Companies can build sophisticated and personalized digital experiences that yield higher conversion rates and improved business outcomes. And now, companies can accomplish this with minimal data science experience – limiting the cost, recruiting, and retention challenges associated with hiring for that skill.
One advantage of search engines that retrieve results based on meaning and similarity is that they help avoid zero-results retrieval problems. For example, Disney’s popular “Mandalorian” series is not available on either HBO Max or Apple TV. Searching for Mandalorian on the HBO Max or Apple TV yields this dead-end message:
But searching for that title on Netflix does not draw a blank screen. Thanks to similarity search, it can suggest results that are conceptually similar.
Besides similarity search, vector databases enable a number of other use cases:
- Search engines
- Automated Q&A
- Recommendation engines
- Classification or near-classification solutions (e.g. anomaly detection)
- Deep personalization
- Low-latency edge AI applications, etc.
- And, once again, generative AI tools like ChatGPT
Vector databases will become essential elements in the Foundational ML/LLM stack
ChatGPT, Langchain, Midjourney, and Stable Diffusion occupy today’s discourse. And while it seems that large language models are showing a remarkable ability to mimic human reasoning and creative output, they are still not reliable. Even most advanced models such as GPT-4 that are greatly enhanced using RLHF techniques still hallucinate, deviate or omit facts.
Besides trust issues, LLMs face several other shortcomings:
- Storing massive amounts of information, knowledge, concepts, and abstractions
- ChatGPT’s knowledge cutoff is September 2021. Only giants like Microsoft or Google, who have access to massive compute, can encode new information
The use of a vector database will also facilitate tracking data lineage. This matters because if users are to trust model output, they will need to be able to track the movement of data over time from the source system through different forms of persistence and transformations. These are some of the reasons we’re certain that vector databases will become essential elements in any practical and reliable AI stack.
In conclusion, our cognitive processes involve perceiving, storing, and structuring information in a specific manner. Subconsciously, we are able to group concepts into semantic categories, such as recognizing that cars and trucks belong to the same group, while cars and dogs are fundamentally different. Throughout our lives, we continuously gather and organize this information, forming the basis of our knowledge.
So far, vector database technology has emerged as the most promising approach in mimicking the way human memory functions. By employing vector databases in conjunction with large language models, we can develop computer-based systems that closely emulate the complex processes of human cognition. This synergy not only deepens our understanding of the relationship between information and knowledge but also paves the way for truly human-like artificial intelligence.