An Intro to Building Knowledge Graphs An Intro to Building Knowledge Graphs
Editor’s note: Sumit Pal is a speaker for ODSC East this April 23-25. Be sure to check out his talk, “Building... An Intro to Building Knowledge Graphs

Editor’s note: Sumit Pal is a speaker for ODSC East this April 23-25. Be sure to check out his talk, “Building Knowledge Graphs,” there!

Graphs and Knowledge Graphs (KGs) are all around us. We use them every day without realizing it. GPS leverages graph data structures and databases to plot routes from point to point. Social media is modeled with graphs. Cell phone technology leverages graphs to figure out phone towers to route the call with a triangulation algorithm as one moves from one place to another.

In-Person and Virtual Conference

September 5th to 6th, 2024 – London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.


KGs are built on top of graph databases and are omnipresent too. The moment you use a search engine like Google, Bing, or Baidu, KGs jump in action to provide semantic and contextual search – that is based NOT on “Strings” and keywords – BUT on “Things” and concepts.

Emerging data management products – data catalogs, data fabric, etc – leverage KGs as the core linking and semantic engine. eBay, LinkedIn, BBC, Thomson Reuters, JPMC, NASA, and other Fortune 500 companies routinely leverage KGs.

What is a Knowledge Graph?

Before we discuss KGs – let us take a small detour to understand graph models. There are two types of graph models – Label Property Graph (LPG) and Resource Description Framework (RDF).


Labeled Property Graph (LPG) 

LPG uses labels for nodes and edges which characterize entities and relationships. Nodes are linked uni/bi-directionally to other nodes through edges. Both nodes and edges have associated properties modeled as key values with primitive data types and are single-valued. LPGs support “index-free adjacency” which makes it ideal for graph traversals to implement graph algorithms like shortest path between nodes, clustering, and centrality. 

Resource Description Framework (RDF) Model 

RDF is for encoding semantic relationships between data items that are broken down into a triple structure composed of Subject, Predicate, and Object. Predicate is the graph edge connecting endpoints Subject and Object. It uses Unified Resource Identifiers (URIs) to identify logical or physical resources of the triple. 

The value of RDF is in making statements and connecting concepts with relationships. It contextualizes the data with ontologies, taxonomies, and vocabularies. RDF is used for data publishing and data interchange and is based on W3C standards. It supports schema evolution and formalism in RDFs resulting in the emergence of semantics. Adherence to standards promotes alignment of meaning, unambiguous interpretation, interoperability, and semantic data integration.

Knowledge Graphs (KGs)

Think of KGs as a graph database with a knowledge toolkit.  A KG models the knowledge of a domain as a graph with a network of entities & relationships connecting them. It models the facts of a domain and includes domain rules.  

The knowledge model is a collection of interlinked descriptions of concepts, entities, relationships, and events. Concepts describe data, connections provide context that gives comprehension. KGs put data in context via linking and semantic metadata and provide a framework for data integration, unification, analytics, and sharing.

A KG modeled with RDF supports inferencing & reasoning – i.e. deriving new facts from existing ones. This enables entity resolution and relation extraction from structured and unstructured data.

Not every graph is a KG. The figure below shows how overlaying an ontology (shoe ontology) enhances and enriches the original graph. The enriched graph provides automated reasoning (shown in the RHS).

KG is a representation of an organization’s knowledge, domain, and artifacts that is understood by humans and machines. KGs help organizations create a knowledge model representing the business and the entities in the domain. This semantic network of facts is used for data integration, knowledge discovery and analysis.

Why Knowledge Graphs – Use Cases

KGs can be used in multiple ways – as a database that can be queried, as a graph that can be analyzed as a network, and as a knowledge base where new facts can be inferred. KGs can be used for discovering previously unknown connections, and enabling inferencing and rule-based reasoning to automate the generation of new knowledge through data relationship discovery and exploration.

Uses and applications of knowledge graphs include data and information-heavy services like contextually aware content recommendationdrug discoverysemantic searchinvestment market intelligenceinformation discovery in regulatory documentsadvanced drug safety analytics, and much more.

The mind maps below show the range of capabilities of KGs.

How to Build Knowledge Graphs

A KG is not a one-off engineering project. Building a KG requires collaboration between functional domain experts, data engineers, data modelers, and key sponsors. It requires ontology, taxonomy, vocabulary, graph databases, semantic mapping tools, data mapping framework, and data extraction capabilities from heterogeneous sources.

Taxonomy is a classification scheme a knowledge map, the information model that describes and structures information in a hierarchy. It is effective for organizing content and data. Captures context and meaning making data easy to find and understand. Examples include – the Dewey Decimal System for books and the organization of living things (Kingdom, Phylum, Class, Order, Family, Genus). Taxonomy provides consistent metadata and tagging, helping to improve precision and recall, and is the foundation for building smart search/discovery applications.

In-Person & Virtual Data Science Conference

October 29th-31st, 2024 – Burlingame, CA

Join us for 300+ hours of expert-led content, featuring hands-on, immersive training sessions, workshops, tutorials, and talks on cutting-edge AI tools and techniques, including our first-ever track devoted to AI Robotics!


Ontology is the schema for graph data that identifies, and distinguishes concepts and relationships. It is a shared vocabulary that describes the semantics of domain data. A lack of ontology creates ambiguity. 

Well-known ontologies and taxonomies are publicly available that can be re-used and adapted for domain-specific applications. 

The figure below shows the 10 steps to building KGs.

KGs and LLMs

LLMs and KGs cross-pollinate to build synergistic solutions with their convergence. 

LLMs can enrich KGs with relation and event extraction from texts. LLMs can aid KG construction with ontology prompting and generate text descriptions for entities. LLMs can classify entities in a KG and help with knowledge retrieval by generating graph search queries, summarize graph query results, and explain complex queries and schema. Using the RAG pattern, LLMs can be enriched by KGs to leverage proprietary documents and metadata. LLMs accelerate KG development by bootstrapping with a given ontology/taxonomy. 

KGs can improve accuracy and reduce hallucinations in LLMs by providing a factual foundation to anchor and validate responses. KGs allow LLM output to be supported by reason. With their structured domain representation, generative AI performance is enhanced by providing context, which furthers understanding. KGs facilitate knowledge retrieval and integration, enriching and integrating diverse structured and unstructured data, and incorporating relevant information into LLM responses. KGs provide explainability, transparency, and enable provenance, to understand and validate LLM responses.

What’s next?

Please join the session of Building Knowledge Graphs –  Day 2 – 04/24/2024 @ 3.40 pm to learn more about Knowledge Graphs and how to build them

About the Author:

Sumit Pal is an ex-Gartner VP Analyst in the Data Management & Analytics space where he advised CTOs, CDOs, CDAOs, Enterprise Architects and Data Architects on Data Strategy, Data Architectures, Data Engineering for building data platforms. Sumit spans the spectrum – from formulating data strategy with CDO/CTO teams to architecting, designing and building data platforms and solutions to writing, deploying and debugging. With more than 25y of experience in data and Software Industry roles spanning companies from startups to enterprise organizations in building, managing, guiding teams to build scalable software systems across the stack from data layer, analytics using BigData, NoSQL, DB Internals, Data Warehousing, Data Modeling, Data Science and AI. Sumit has experience in building, managing and guiding teams and building scalable software systems across the stack from middle tier, data layer, analytics, ML, Data Engineering, DataOps, Data Architectures, Data Lakes, Data Lakehouses, NoSQL, DB Internals, Data Warehousing, Dimensional Modeling, Data Science and Java / J2EE aspects of the technology. Published author of a book on SQLEngines and developed MOOC course on Big Data Hiked to Mt. Everest Base Camp in Oct 2016. Blogs at https://sumitpal.wordpress.com.


ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.