Apache Drill’s Data Science Potential
With the rising demand for big data analysis, several tools have been launched in the past couple of years. My unwavering love for SQL spurred me to look for analysis tools which are easier to learn/use with proficiency in SQL. I have come across Drill very recently and I am quite impressed with the flexibility as well the wide variety of options offered by this tool. Drill is an Apache open source query engine for big data analysis. Studies show that data scientists spend 50-90 percent of their time gathering and preparing data. Apache Drill is a relatively new tool that can help solve this difficult problem by allowing analysts and data scientists to query disparate data sets in-place using standard ANSI SQL. It allows the users to analyze without having to define complex schema, or having to rebuild their entire data infrastructure. Drill is a boon in disguise for anyone that relies on SQL to make meaningful inferences from data sets.
Drill is inspired by Google’s Dremel and is designed to scale to several thousands of nodes and query petabytes of data at incredible speed that BI/Analytics environments require. Drill includes a distributed execution environment, which is built for large scale data processing. Though Drill works in a Hadoop cluster environment, Drill is not tied to Hadoop and can run in any distributed cluster environment.
When data comes across a Data Scientist, it often comes in unstructured and in multiple formats from a diversity of sources. Compared to other open source data analysis tools, Drill has clear advantages. Drill allows users to explore data in any kind of format including structured, semi structured as well as unstructured as opposed to HIVE which can only be used to explore structured data. Drill also supports a variety of NoSQL databases and file systems, including
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
- local files.
Drill does not require schema or type specification for data in order to start the query execution process. Drill starts data processing in record-batches and discovers the schema during processing. Self-describing data formats such as Parquet, JSON, AVRO, and NoSQL databases have schema specified as part of the data itself, which Drill leverages dynamically at query time. Another exclusive feature of Apache Drill is Drill Explorer. Drill Explorer is a user interface for browsing Drill data sources, previewing the results of a SQL query, and creating a view and querying the view as if it were a table. Drill explorer helps you to examine and understand the metadata in any format before querying or designing views, which are used to visualize data in BI/Analytics tools like tableau. It allows the user to explore structure, size, content of data in any format.
The Apache Drill Explorer window has two tabs: browser tab and SQL tab. The Browse tab lets the user view any existing metadata for a schema that you access with Drill. SQL tab allows the user to preview the results of custom queries and save the results as a view.
Drill is extensible and can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions thus mitigating ETL(Extraction-transformation-loading) process to combine data from multiple sources to perform the required analysis. We need to add required storage plugins based on the available data sets in order to be able to explore these disparate datasets. Here’s a query that combines data from a Hive table, an HBase table (view) and a JSON file: Here’s a query that combines data from a Hive table, an HBase table (view) and a JSON file:
Select – custview.membership, sum(orders.order_total) AS sales
From – hive.orders, custview, dfs.`clicks/clicks.json` c
Where – orders.cust_id = custview.cust_id AND orders.cust_id = c.user_info.cust_id
Group by – custview.membership
Order by – 2;
Below is a short tutorial of how to experiment with Yelp data set using Apache Drill. The data set can be found at yelp. The data is in JSON format. The term dfs. <path-to-yelp-data set> used in each of the queries below refers to the path where the data set is saved on a machine and it needs to be modified to include the path to yelp data set on your machine when trying this example.
The first step is to explore the data present in those JSON files using Drill. We can use SQL SELECT to view the contents of the JSON file. We can restrict the number of rows to be returned by using LIMIT in the SELECT statement. We need to provide the location of the JSON file in the SELECT statement in place of table name in a regular SQL query. We can directly query self-describing files such as JSON, Parquet, and text files.
We can explore the review data set further by examining specific columns corresponding to JSON file. We can also use aggregation functions like sum in the SQL statements. Below are a few instances of data exploration for the Yelp data set.
Total Number of reviews in the data set
Top states and their cities with highest number of reviews
Top Hotels with count of reviews greater than 1000
We can view the attributes in Yelp business data set by turning on the text mode in Drill. We need to turn off the text mode when trying to perform arithmetic operations on the data set. We can use an alter statement to set the text mode in Drill
alter system set `store.json.all_text_mode` = false;
alter system set `store.json.all_text_mode` = true;
We can combine Yelp business data set with the Yelp review data set which holds additional details corresponding to each review to perform the required analysis. Below is the query to retrieve list of business(hotels) which have received more than 2000 votes as a good restaurant/Hotel.
We can create views in Drill which can be used visualize/query in BI/analytics tools such as Tableau or we can query on the views in Drill to retrieve the required data. Below is the SQL statement to create views which uses both Yelp business and review data sets
We can query on the view created above. For instance, we can retrieve the total numbers records corresponding to the view using a select statement on the view
Business users, analysts and data scientists use standard BI/analytics tools such as Tableau, Qlik View, Micro Strategy, SAS and Excel to interact with non-relational data stores by leveraging Drill’s JDBC and ODBC drivers.
Drill’s symmetrical architecture and simple installation make it easy to deploy and operate very large clusters. Drill is the world’s first and only distributed SQL engine that doesn’t require schema. All of these features make Apache Drill most desirable tool for Data Analysis.