It is generally a major debate amongst many data analysts and data scientists if they should learn SQL to be able to do their job, especially when you have great packages and libraries to ingest data in R/Python directly. In today’s day and age with the rise of data lakes, serverless architecture, and cloud data warehouses, if a lot of your data is already stored in databases in a structured format, then why not use the power of these incredible cloud databases, which have auto-scaling functionality to do Exploratory Data Analysis(EDA) and prepare the data required to run the machine learning models? Knowing SQL will also help you get intimate with your data and learn how relational databases work, and if you are a product analyst/data scientist working on a SAAS platform, it definitely helps you understand how the data is captured and the flow of data with the platform.
In our example, we are assuming we are a Brazilian eCommerce company and would love to perform customer segmentation for setting up our marketing strategy for different segments of our clusters. Our data for this blog post has come from https://www.kaggle.com/olistbr/brazilian-ecommerce.
This dataset consists of details of the customers who bought items on the site, what products they bought, sellers they bought from, and, of course, the order and the actual quantity and the price of the different items bought in every order. All the sourced data is available in the data lake (built on Postgresql database) created by our lovely hypothetical data engineers.
Now, it is our job as data analysts/data scientists to look through that data, understand it, and model the data for customer segmentation.
After some research, we decided to go with the LRFM clustering model to segment our customers. Hence we need to have the data modeled in a way where the granularity of the dataset is at the customer level. The attributes of the table will be length (customer tenure), recency (when they last bought on the website), frequency (how often do they buy and/or how many items they bought on the site), and monetary (value of goods they buy).
Let’s understand how these tables are connected to each other.
So, let’s start by creating a view of all the required data points
create or replace view odsc_workshop.full_dataset as ( select c.customer_id ,c.customer_city ,c.customer_state ,oi.order_id ,oi.order_item_id ,oi.price ,oi.freight_value ,o.order_status ,o.order_delivered_customer_date ,p.product_id ,p.product_category_name ,s.seller_id ,s.seller_city ,s.seller_state ,op.payment_sequential ,op.payment_type ,op.payment_installments ,op.payment_value ,o.order_purchase_timestamp from odsc_workshop.customers c left join odsc_workshop.orders o on c.customer_id = o.customer_id left join odsc_workshop.order_payments op on o.order_id = op.order_id left join odsc_workshop.order_items oi on o.order_id = oi.order_id left join odsc_workshop.products p on oi.product_id = p.product_id left join odsc_workshop.sellers s on oi.seller_id = s.seller_id )
Now let’s start to massage our data to calculate the length, recency, frequency, and monetary values of every customer who has purchased on this site.
select customer_id ,total_items_bought ,monetary_value ,date_part('day',last_purchase_date - first_purchase_date) as customer_tenure_in_days ,date_part('day',current_date - last_purchase_date) as days_since_last_purchase from ( select customer_id , min(order_purchase_timestamp) as first_purchase_date , max(order_purchase_timestamp) as last_purchase_date , count(distinct order_item_id) as total_items_bought , sum(price) as monetary_value from odsc_workshop.full_dataset group by 1 ) lrfm
Now that you have seen the data and it looks good, let’s create a view which can then be used in python to carry out our customer segmentation task. This is so that others who want to use this dataset do not need to massage the data again. Or other data scientists can collaborate with you on this dataset to add more value to the task at hand as you learn SQL.
create or replace view odsc_workshop.lrfm_data as ( select customer_id ,total_items_bought ,monetary_value , date_part('day',last_purchase_date - first_purchase_date) as customer_tenure_in_days ,date_part('day',current_date - last_purchase_date) as days_since_last_purchase from ( select customer_id , min(order_purchase_timestamp) as first_purchase_date , max(order_purchase_timestamp) as last_purchase_date , count(distinct order_item_id) as total_items_bought , sum(price) as monetary_value from odsc_workshop.full_dataset group by 1 ) lrfm )
Now that the data is ready as a view, you can directly pull it in python to do the modeling. The below code walks you through the steps to do so.
###Import Libraries import psycopg2 import pandas as pd ###Connect to the database conn = psycopg2.connect(user="postgres", password="XXXXXXXXX", host="127.0.0.1", port="5432", database="odsc_sql") ###Open the cursor cur = conn.cursor() ###Select your Dataset query = "select * from odsc_workshop.lrfm_data" ###Execute your query cur.execute(query) ###Put it in a dataframe df_purchases = pd.read_sql_query(query,conn) ###Check your data df_purchases.head() ###Close the cursor and the connection cur.close() conn.close()
Now you can work on your clustering algorithm using the dataframe df_purchases.
So, if you want to know more about designing your data model and learning how to write fast and efficient SQL queries like we did in this blogpost, come over to my upcoming ODSC APAC workshop “SQL for Data Science” on December 9th and learn SQL. You will learn all the basics and learn more advanced analytical functions which will help you in your day to day job and no data engineer will ever get annoyed at you for writing terrible queries. Feel free to go through the Prerequisites video to get the stage ready for the workshop, so that we can dive straight into writing queries.
About the author/ODSC APAC speaker: Duhita Khadepau
Graduating with a degree in Information Technology, my love of working with data landed me my first job. Spanning three continents, my projects saw me architect Data Warehouses and Data Analytics platforms across banking, retail industries, and startups. Currently, I am building the data platform and strategy at a construction tech B2B company called Assignar. Data, being a relatively new field is still shackled by the archaic ways of the old technology days with women still being far and few. Being a woman in this emerging area brings with it a sense of pride, but also a sense of responsibility. Empowering women to challenge the status quo and consider a career in Data is also on my agenda. I also enjoy keeping myself involved in academia by teaching in the Master of Data Science and Innovation(MDSI) program as a lecturer for Data Science Practices where I get to teach and mentor budding Data enthusiasts. One of my areas of interest is deep learning (neural networks and AI). I am fascinated with the possibilities of its application across industries and am keen to realize its potential in mainstream businesses. Hence I further aim to continue my quest for knowledge and discovery with an Industry PhD in the field of Neural Networks and Project Management.