fbpx
SQL Equivalents in R SQL Equivalents in R
Whenever I’m teaching introductory courses in data science using the R language, I often encounter students who use a different language... SQL Equivalents in R

Whenever I’m teaching introductory courses in data science using the R language, I often encounter students who use a different language like Python or Julia, and still others who are transitioning into data science from other fields and don’t know any data science language at all. The common thread is they need to get up to speed with R quickly to be productive, so I like to address their needs in a creative way. One technique I use is to find some commonality in terms of background and experience. I find that most newbie data scientists have some familiarity with SQL, so that’s a good starting point.

In this article, we’ll review a number of ways R has to access data sets (once stored in an R data frame) that are similar to the functionality found in SQL. If you already know SQL, it is useful to see the equivalent R way of doing things for manipulating data. To illustrate these techniques, we’ll use the CO2 data set that’s part of the base R system. Here is the structure and a short extract of the data set’s content:

> data(CO2)

> head(CO2)

 Plant   Type Treatment conc uptake

1   Qn1 Quebec nonchilled   95 16.0

2   Qn1 Quebec nonchilled  175 30.4

3   Qn1 Quebec nonchilled  250 34.8

4   Qn1 Quebec nonchilled  350 37.2

5   Qn1 Quebec nonchilled  500 35.3

6   Qn1 Quebec nonchilled  675 39.2

The small dataset has 84 observations along with 5 variables: Plant, Type, Treatment, conc, and uptake. Here is the SQL SELECT statement to retrieve rows from the data frame based on a filter condition:

SELECT * FROM CO2 WHERE conc>400 AND uptake>40

The R equivalent uses the following simple syntax:

CO2_subset <- CO2[CO2$conc>400 & CO2$uptake>40,]

head(CO2_subset)

  Plant   Type Treatment conc uptake

12   Qn2 Quebec nonchilled  500 40.6

13   Qn2 Quebec nonchilled  675 41.4

14   Qn2 Quebec nonchilled 1000   44.3

19   Qn3 Quebec nonchilled  500 42.94

20   Qn3 Quebec nonchilled  675 43.9

21   Qn3 Quebec nonchilled 1000   45.5

> dim(CO2_subset)   # 8 observations selected

[1] 8 5

Now let’s consider an example of the ORDER BY clause of the SQL SELECT statement. Here, we wish to sort the result set by the variable conc in ascending sequence, and then by the variable uptake in descending sequence. Here is the SQL statement:

SELECT * FROM CO2 ORDER BY conc, uptake DESC

The R equivalent uses the following simple syntax which includes some additional R syntax to limit the result set to the first 20 records:

> CO2[order(CO2$conc, -CO2$uptake),][1:20,]

Plant        Type Treatment conc uptake

15   Qn3    Quebec nonchilled   95 16.2

1    Qn1    Quebec nonchilled   95 16.0

36   Qc3    Quebec chilled   95 15.1

22   Qc1    Quebec chilled   95 14.2

8    Qn2    Quebec nonchilled   95 13.6

50   Mn2 Mississippi nonchilled   95 12.0

57   Mn3 Mississippi nonchilled   95 11.3

43   Mn1 Mississippi nonchilled   95 10.6

78   Mc3 Mississippi    chilled 95 10.6

64   Mc1 Mississippi    chilled 95 10.5

29   Qc2    Quebec chilled   95 9.3

71   Mc2 Mississippi    chilled 95 7.7

16   Qn3    Quebec nonchilled  175 32.4

2    Qn1    Quebec nonchilled  175 30.4

9    Qn2    Quebec nonchilled  175 27.3

30   Qc2    Quebec chilled  175 27.3

23   Qc1    Quebec chilled  175 24.1

51   Mn2 Mississippi nonchilled  175 22.0

37   Qc3    Quebec chilled  175 21.0

58   Mn3 Mississippi nonchilled  175 19.4

Another powerful SQL construct is the GROUP BY clause used to compute aggregate values such as average. Continuing with our example, we’d like to calculate the average uptake value for each unique value of Plant. Here is the SQL statement to achieve this:

SELECT Plant, AVG(uptake) FROM CO2 GROUP BY Plant

The R equivalent uses the following syntax based on the aggregate() function. The first argument is x=CO2[,c(“uptake”)], which isolates the uptake column from the CO2 data frame. The second argument is by=data.frame(CO2$Plant), which is the grouping variable. Finally, the FUN=”mean” argument is the R function which is used to compute summary statistics, in this case, the mean.

> aggregate(x=CO2[,c(“uptake”)], by=data.frame(CO2$Plant), FUN=”mean”)

  CO2.Plant        x

1        Qn1 33.22857

2        Qn2 35.15714

3        Qn3 37.61429

4        Qc1 29.97143

5        Qc3 32.58571

6        Qc2 32.70000

7        Mn3 24.11429

8        Mn2 27.34286

9        Mn1 26.40000

10       Mc2 12.14286

11       Mc3 17.30000

12       Mc1 18.00000

The last example will be how to do a SQL JOIN using R. In this case, we’ll look up the country from a secondary table that maps states and provinces to countries. Here is the SQL statement to achieve this:

SELECT c.Type,

c.Plant,

c.Treatment,

c.conc,

c.uptake,

g.country

FROM geo_map g

LEFT JOIN CO2 c ON(c.Type = g.Type)

The R equivalent uses the following syntax based on the merge() function. Looking at the first several observations from the CO2 dataset, we see the variable Type, which contains the state or province of origin for the plant. We’ll use this variable as the common key value. Next, we’ll create a new data frame, geo_map, which plays the role of a look-up table in which the country is paired with the state/province. Then, after assigning appropriate column names for geo_map, we will use merge() to produce the equivalent of a SQL joined result set, the joinCO2 data frame. The new data frame contains a variable country, which is the lookup value based on Type. The R code has the same effect as the SQL JOIN.

> head(CO2)

 Plant   Type Treatment conc uptake

1   Qn1 Quebec nonchilled   95 16.0

2   Qn1 Quebec nonchilled  175 30.4

3   Qn1 Quebec nonchilled  250 34.8

4   Qn1 Quebec nonchilled  350 37.2

5   Qn1 Quebec nonchilled  500 35.3

6   Qn1 Quebec nonchilled  675 39.2

 

> stateprov <- c(“Mississippi”, “California”, “Victoria”, “New South Wales”, “Quebec”, “Ontario”)

> country <- c(“United States”, “United States”, “Australia”, “Australia”, “Canada”, “Canada”)

 

> geo_map <- data.frame(country=country, stateprov=stateprov)

> geo_map

       country           Type

1 United States     Mississippi

2 United States      California

3     Australia        Victoria

4     Australia New South Wales

5        Canada       Quebec

6        Canada      Ontario

 

> colnames(geo_map) <- c(“country”, “Type”)

> joinCO2 <- merge(CO2, geo_map, by=c(“Type”))

> head(joinCO2)

        Type Plant  Treatment conc uptake       country

1 Mississippi   Mn1 nonchilled 95   10.6 United States

2 Mississippi   Mn1 nonchilled 175   19.2 United States

3 Mississippi   Mn1 nonchilled 250   26.2 United States

4 Mississippi   Mn1 nonchilled 350   30.0 United States

5 Mississippi   Mn1 nonchilled 500   30.9 United States

6 Mississippi   Mn1 nonchilled 675   32.4 United States

If you prefer to stick with your knowledge of SQL, you can access R data frames directly with standard SQL queries through the use of the sqldf package.

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

1