Using GRAKN.AI to reason over an R dataset
BlogRposted by Jo Stichbury June 13, 2017
In this article I will introduce an open-source knowledge graph platform called GRAKN.AI. I’m going to use it to load a simple dataset, and show how to calculate basic statistics such as maximum and mean values. A good question at this point would be: as a data scientist, surely there are easier ways for me to make such simple calculations? The answer: yes, there are! But I’ve chosen this familiar example to introduce the knowledge graph paradigm, the strength of which comes into play for large amounts of highly interconnected data. To keep the example simple, I’ve removed accidental complexity by using a familiar dataset in a new way.
This article will be useful for data scientists interested in using a new approach to modeling complex and/or big datasets. You don’t need any experience with GRAKN.AI to understand it, because I’ll explain the key concepts as I go along. So let’s get started!
What is GRAKN.AI?
GRAKN.AI is an open-source distributed knowledge base with a reasoning query language called Graql (not to be confused with GraphQL) that enables you to query for explicitly stored data and implicitly derived information. It is built using graph computing (Apache TinkerPop) which allows you to traverse links to discover how remote parts of a domain relate to each other. Various graph-computing techniques and algorithms can be applied, such as shortest path computations or network analysis, which add additional intelligence over the stored data.
Some of the potential applications include: semantic search, automated fraud detection, intelligent chatbots, advanced drug discovery, dynamic risk analysis, content-based recommendation engines, and knowledge management systems.
This example uses a dataset that will be familiar to students of R – mtcars (Motor Trend Car Road Tests) data. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 other aspects of automobile design and performance for 32 automobiles (1973–74 models). I took csv file of the mtcars data and added two new columns to indicate the car maker’s name and region that the car was made in (Europe, Japan, or North America). This file, and everything else for this example, can be found on Github, and is also included in the examples folder of the GRAKN.AI distribution.
When working with GRAKN.AI, a key step is the definition of an ontology, which allows you to model the data. We’ve published a number of articles about the GRAKN.AI ontology, including a recent blog post, but, to keep it simple, I’d suggest you think of it rather like a class definition in C++ or Java. An ontology specifies the relevant concepts and their meaningful associations. The ontology has four types of concepts to model the domain: entity, relation, role, and resource.
entity: Objects or things in the domain. For example, car, carmaker.
relation: Relationships between different domain instances. For example, manufactured, which is typically a relationship between two instances of entity types (car and carmaker), playing roles of made and maker, respectively.
role: Roles involved in specific relationships. For example, made, maker.
resource: Attributes associated with domain instances. For example, model. Resources consist of primitive types and values.
With GRAKN.AI’s declarative query language, Graql, I have represented the mtcars dataset using the following ontology, stored in the ontology.gql file, although many other variations are possible:
vehicle sub entity
car sub vehicle
automatic-car sub car;
manual-car sub car;