ferrari-data

Using GRAKN.AI to reason over an R dataset

Introduction

In this article I will introduce an open-source knowledge graph platform called GRAKN.AI. I’m going to use it to load a simple dataset, and show how to calculate basic statistics such as maximum and mean values. A good question at this point would be: as a data scientist, surely there are easier ways for me to make such simple calculations? The answer: yes, there are! But I’ve chosen this familiar example to introduce the knowledge graph paradigm,  the strength of which comes into play for large amounts of highly interconnected data. To keep the example simple, I’ve removed accidental complexity by using a familiar dataset in a new way.

This article will be useful for data scientists interested in using a new approach to modeling complex and/or big datasets. You don’t need any experience with GRAKN.AI to understand it, because I’ll explain the key concepts as I go along. So let’s get started!

What is GRAKN.AI?

GRAKN.AI is an open-source distributed knowledge base with a reasoning query language called Graql (not to be confused with GraphQL) that enables you to query for explicitly stored data and implicitly derived information. It is built using graph computing (Apache TinkerPop) which allows you to traverse links to discover how remote parts of a domain relate to each other. Various graph-computing techniques and algorithms can be applied, such as shortest path computations or network analysis, which add additional intelligence over the stored data.

Some of the potential applications include: semantic search, automated fraud detection, intelligent chatbots, advanced drug discovery, dynamic risk analysis, content-based recommendation engines, and knowledge management systems.

The Data

This example uses a dataset that will be familiar to students of R – mtcars (Motor Trend Car Road Tests) data. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 other aspects of automobile design and performance for 32 automobiles (1973–74 models). I took csv file of the mtcars data and added two new columns to indicate the car maker’s name and region that the car was made in (Europe, Japan, or North America).  This file, and everything else for this example, can be found on Github, and is also included in the examples folder of the GRAKN.AI distribution.

When working with GRAKN.AI, a key step is the definition of an ontology, which allows you to model the data. We’ve published a number of articles about the GRAKN.AI ontology, including a recent blog post, but, to keep it simple, I’d suggest you think of it rather like a class definition in C++ or Java. An ontology specifies the relevant concepts and their meaningful associations. The ontology has four types of concepts to model the domain: entity, relation, role, and resource.

entity: Objects or things in the domain. For example, car, carmaker.

relation: Relationships between different domain instances. For example, manufactured, which is typically a relationship between two instances of entity types (car and carmaker), playing roles of made and maker, respectively.

role: Roles involved in specific relationships. For example, made, maker.

resource: Attributes associated with domain instances. For example, model. Resources consist of primitive types and values.

With GRAKN.AI’s declarative query language, Graql, I have represented the mtcars dataset using the following ontology, stored in the ontology.gql file, although many other variations are possible:

insert

# Entities

vehicle sub entity
is-abstract;

car sub vehicle
is-abstract

has model
has mpg
has cyl
has disp
has hp
has wt
has gear
has carb
plays made;

automatic-car sub car;
manual-car sub car;

carmaker sub entity
is-abstract
has maker-name
plays maker;

japanese-maker sub carmaker;
american-maker sub carmaker;
european-maker sub carmaker;

# Resources

model sub resource datatype string;
maker-name sub resource datatype string;
mpg sub resource datatype double;
cyl sub resource datatype long;
disp sub resource datatype double;
hp sub resource datatype long;
wt sub resource datatype double;
gear sub resource datatype long;
carb sub resource datatype long;

# Roles and Relations

manufactured sub relation
relates maker
relates made;

maker sub role;
made sub role;

There are two main entities: vehicle and carmaker, but I’ve used inheritance (that’s the sub keyword) to set up a hierarchy. A manual-car (or automatic-car) is a subtype of car, which is a subtype of vehicle. Likewise, a japanese-maker is a subtype of carmaker, as is the american-maker and european-maker entity. The entities have some resources, such as numerical values to represent fuel consumption, horsepower etc for the cars, and string values to represent the name for carmaker. There is a single relation (manufactured) within the data, between the car and carmaker entities, where the car plays the made role and the carmaker plays the maker role.

The first step in running the example is to load this ontology into a graph. Having installed GRAKN.AI, you start the engine and load ontology.gql by typing the following into a terminal window:


<relative-path-to-Grakn>/bin/grakn.sh start
<relative-path-to-Grakn>/bin/graql.sh -f ./ontology.gql

Now to load the mtcars data into the graph, which I have munged into a single data file (data.gql) for easy loading. However, GRAKN.AI does allow you to import CSV (as well as TSV, SQL, JSON and OWL data), so it is perfectly possible to pull it in directly from the CSV file. The readme file in the Github repository gives further information.

To load the mtcars data:

<relative-path-to-Grakn>/bin/graql.sh -b ./data.gql

Now you can take a look at the dataset by spinning up the Grakn visualiser by pointing your browser to http://localhost:4567/. You can submit queries to check the data, or explore it using the Types dropdown menu.

Screen Shot 2017-05-04 at 16.31.56.png

Blue – European manufacturers

Red – Japanese manufacturers

Pure – American manufacturers

Green – Automatic cars

Yellow – Manual cars

Some sample queries: don’t type the lines starting with # (these are just comments):

# Cars where the model name contains “Merc” (7 cars)
match $x has model contains “Merc”;

Screen Shot 2017-05-04 at 16.35.08.png

# Cars with more than 4 gears (should be 5 cars)
match $x has gear > 4;

# Japanese-made cars that are manual (should be 5 cars)
match $x isa manual-car; $y isa japanese-maker; (made: $x, maker:$y);

# European cars that are automatic (should all be Mercedes)
match $x isa automatic-car; $y isa european-maker; (made: $x, maker:$y);

At this point, you are ready to start investigating statistics within the data using Graql aggregate and compute queries.

aggregate

The Graql aggregate keyword is the workhorse for statistics. Switch views using the left hand navigation pane, from Graph to Console to submit some queries. Here are some example aggregate queries to try:

# Count of all cars (32)
match $x isa car; aggregate count;

# Count American car makers (6)
match $x isa american-maker; aggregate count;

# Maximum MPG for an automatic car (24.4)
match $x isa automatic-car, has mpg $a; aggregate max $a;

# Minimum HP for all cars (52)
match $x isa car, has hp $hp; aggregate min $hp;

# Mean MPG for manual and automatic cars (24.39, 17.15)
match $x isa manual-car has mpg $mpg; aggregate mean $mpg;
match $x isa automatic-car has mpg $mpg; aggregate mean $mpg;

# Median number of cylinders (all Mercedes cars) (6)
match $x has model contains “Merc”, has cyl $c; aggregate median $c;

# Maximum number of carburetors (all Chrysler cars) (4)
match $x has model contains “Chry”, has carb $c; aggregate median $c;

# Minimum number of gears (all cars) (3)
match $x isa car, has gear $g; aggregate min $g;

compute

Graql also provides compute queries that can be used to determine values such as mean, minimum, and maximum. These can be submitted using the Graph view on the Visualiser. For example, type each of the following into the form and submit:

# Number of automatic (19) and manual cars (13)
compute count in automatic-car;
compute count in manual-car;

# Number of Japanese car makers (4)
compute count in japanese-maker;

# Median number of cylinders (all cars) (6)
compute median of cyl;

# Minimum number of gears (all cars) (3)

compute min of gear;

# Maximum number of carburetors (all cars) (8)

compute max of carb;

# Mean MPG for an automatic car (17.15)
compute mean of mpg in automatic-car;

# Mean MPG for a manual car (24.39)
compute mean of mpg in manual-car;


When to Use aggregate and When to Use compute?

Graql’s aggregate queries are computationally light and run single-threaded on a single machine. They are also more flexible than the equivalent compute query (for example, you can use an aggregate query to filter results by resource).

match $x isa car has model contains “Merc”; aggregate count; # 7

There are times when compute queries are more powerful. They are computationally intensive and can run in parallel on a cluster, so are good for big data and can be used to calculate results very fast. However, you can’t filter the results by resource in the same way as you can for an aggregate query.

You can perform much more with compute than I have illustrated in this example, for example, you can calculate shortest path between two nodes in the graph, and look at clusters within the data. However, mtcars isn’t a great example for those features, since there aren’t many connections within such a simple dataset. The bonus of using a knowledge graph is that it has a flexible structure: the ontology can be extended and revised as new data is added. So if we found additional data e.g. about dealers offering these cars for sale, links to fan websites, photos or reviews, we could add those in and make compute queries to uncover new information.

Reasoning using Graql

Speaking of new information, it’s time to talk about inference, which can be used to find implicit information from the data. For example, given the following statements:

(If) grass is not an animal.
(If) vegetarians only eat things which are not animals.
(If) sheep only eat grass.

It is possible to infer the following:

(Then) sheep are vegetarians.

The initial statements can be seen as a set of premises. If all the premises are met we can infer a new fact (that sheep are vegetarians). If we hypothetise that sheep are vegetarians then the whole example can be expressed with a particular two-block structure: IF some premises are met, THEN a given hypothesis is true.

This is how reasoning in Graql works. It checks whether a set of Graql statements can be verified and, if they can, makes inference from a second block of statements. The first set of statements (the IF part or, if you prefer, the antecedent) is called the left hand side (LHS). The second part (also know as the consequent), not surprisingly, is the right hand side (RHS). Using Graql, both sides of the rule are enclosed in curly braces and preceded by, respectively, the keywords lhs and rhs.

At the bottom of the ontology.gql file, you’ll see the Graql for reasoning over the dataset. The car entity has two extra resources (strings to represent whether they are powerful and economical, which are set either to TRUE or FALSE by Grakn’s reasoner). There are 4 rules which test whether a car is economical (by checking if its mpg is greater than or equal to 19.0) and whether it is powerful (when it has horsepower equal or above 147).

# Reasoning

car

has economical

has powerful;

economical sub resource datatype string;

powerful sub resource datatype string;

$car-economy-true isa inference-rule

lhs

{$c isa car has mpg >= 19.0;}

rhs

{$c has economical “TRUE”;};

$car-economy-false isa inference-rule

lhs

{$c isa car has mpg < 19.0;}

rhs

{$c has economical “FALSE”;};

$car-powerful-true isa inference-rule

lhs

{$c isa car has hp >= 147;}

rhs

{$c has powerful “TRUE”;};

$car-powerful-false isa inference-rule

lhs

{$c isa car has hp < 147;}

rhs

{$c has powerful “FALSE”;};

By default, inference is switched off, and the only information you can query Grakn about is what was directed loaded from the data. Open the Graql shell by typing the following in the terminal window:

../../bin/graql.sh -n

The -n flag turns on inference, so you can make the following query in the shell:

>>>match $x has model $s, has powerful “TRUE” has economical “TRUE”;
$x id “106584” isa manual-car; $y val “Ferrari Dino” isa model;
$x id “254120” isa automatic-car; $y val “Pontiac Firebird” isa model;

I found the results returned quite surprising. If I want to buy an economical yet powerful car, it seems I will need to save up for a Ferrari Dino or Pontiac Firebird! Or maybe I should add some extra data to the graph that takes purchase price into account, and uses reasoning to find the sweet spot between cheap cost to buy, cost to run and powerful cars.

This is a rather trivial illustration of reasoning, again, somewhat hampered by the simplistic nature of the example. There is a complete example in our documentation, that examines implicit family relationships.

Where Next?

If you haven’t already, I recommend that you review the documentation about aggregate queries and compute queries, since there is more to compute than just statistical analysis. There is also an example of using Graql analytics on the genealogy dataset available here.

This example was based on CSV data migrated into Grakn. Having read it, you may want to further study our documentation about CSV migration and Graql templating.

If you want to read another guide to getting started with GRAKN.AI, we have one on our website to get you up and running with Graql, and a tutorial to get started with Java. We do also have bindings for R, Python and Haskell, although these are currently incomplete.

The code for GRAKN.AI is available on Github, and there is a thriving developer community that offers support over Slack and discussion forums. Check out GRAKN.AI for more information.

©ODSC2017

Feature Image credit: “#Ferrari #dino” by Mycatkins is licensed under CC BY 2.0