fbpx
Watermain Breaks in the City of Toronto Watermain Breaks in the City of Toronto
It has been a while since my last post due to the major transition of moving back to Canada. This post... Watermain Breaks in the City of Toronto

It has been a while since my last post due to the major transition of moving back to Canada. This post will be a bit shorter than my previous ones but hopefully it will give some insight on practically investigating and analyzing open data that are becoming more popular these days. All the code for this post is found at my Github.

Drinking Water

Water is a basic physiological need for humans. We can go days without food and be okay, but without water… that’s another story. Drinking water or potable water is something in the developed world we take for granted. Much of the world does not have great access to clean water and by United Nations had made their 6th Sustainable Development Goal to “ensure access to water and sanitation for all” and that all people would have universal access to safe and affordable water by 2030.

In Ontario (province of Canada) we are fortunate to have good water quality in most cities, towns, municipalities. It’s important to note that a disparity exists between regions across the province: large vs. small cities, southern vs. northern locations, and within indigenous communities. Across the province, there are different types of drinking water systems, various combinations of owners/operators, and (complex) regulatory network ensuring that the quality of the water reaching users is acceptable for consumption.

Watermains Breaks

Within these drinking water systems, the large pipes that carry most of the water through the system from the treatment plant through the distribution system are called watermains (or water mains). In the event of a watermain failing, either by a crack, some leaking or a catastrophic break, this can have potential effects on the flow of water to users. There can be loss of pressure in the distribution system, potentially allowing contaminants (chemical or microbiological) to enter the drinking water system due to the loss of pressure. No one wants to have E. coli or some industrial chemical in their tap water.

There are many factors behind a watermain failure. Most of the time there multiple concurrent factors that cause the pipe to crack or break. A non-exhaustive list of factors include type of pipe material, age of pipe, surrounding soil conditions, corrosion, pressure fluctuations, direct damage, extreme temperatures, extreme weather events, and poor system design.

Toronto Watermain Break Data

The City of Toronto has an Open Data Catalogue containing various types of datasets on a number of topics like environment, public safety, development, and government. You can follow them on Twitter to know when they update/refresh their data. The watermain break data can be found here. I downloaded the Excel flat-file containing the date and geographic location of watermain breaks from 1990 to 2016 within the city of Toronto boundaries. For this post I will be focusing on the temporal nature of the data, aka what do the trends of watermain breaks look like over time in Toronto. I ran into issues on converting the X,Y coordinate data into real latitude and longitude information so that will have to wait (potential part 2 in the future???).

Data Import, Exploration, Processing

After downloading the Excel file into my data folder, I use the list.files function and some regular expression to identify the .xlsx files in the folder. This isn’t necessary since we only have one file in our folder but I’ve decided to make it good practice to have a standard method in selecting multiple files and file types in my data folder in a programmatic fashion. Afterwards I use the read_excel function from the readxl package to import the dataset into R.

1
2
watermain.files <- list.files("data", pattern = "\.xlsx$")
wm.df <- read_excel(paste0("data/",watermain.files))

We see that there are 35461 rows or (assumed) unique watermain breaks. After using the strfunction to get a basic idea of the types of columns I have. I rename the column names using the names function and create a new column with the year as a factor type (I actually don’t end up using this).

1
2
3
4
wm.df <- read_excel(paste0("data/",watermain.files))
names(wm.df) <- c("Date","Year","X_coord","Y_coord") #change column names
wm.df$Year_f <- as.factor(wm.df$Year) #create new column with 'Year' as factor
wm.df$Year <- as.integer(wm.df$Year) #convert original 'Year' to an integer

Next I use the floor_date function from the lubridate package to create new columns containing the month and week. This will be useful later on when I want to aggregate the frequency counts of watermain breaks by week or month.

1
2
wm.df$week <- floor_date(wm.df$Date, unit = "week") #floor to week
wm.df$month <- floor_date(wm.df$Date, unit = "month") #floor to month

Then I just create a new column that contains the same date information but instead of POSIXct, it is in Date format.

1
wm.df <- wm.df %>% mutate(date = as.Date(Date))

This is not related to the temporal information but a good step with the data exploratory workflow. I want to do a bit a quality control/assurance with the data I have. I use the summary function to look at the Date, Year, X_coord, and Y_coord columns. First thing that catches my eye is that there seems to be some extreme values or outliers in the spatial information, specifically the X_coord maximum and Y_coord minimum. For geographic information, typical errors include inputting latitude when it should be longitude, and vice-versa.

1
2
3
4
5
6
7
8
wm.df %>% select(X_coord, Y_coord) %>% summary()
    X_coord           Y_coord
 Min.   : 294164   Min.   : 304472
 1st Qu.: 303580   1st Qu.:4838095
 Median : 311195   Median :4842843
 Mean   : 311820   Mean   :4841655
 3rd Qu.: 318587   3rd Qu.:4846416
 Max.   :4845682   Max.   :4855952

First I look at the X_coord first and discover that the maximum value is not likely a Y_coord and it is the only “error” in this column. The code below shows the steps I took to identify and remove this outlier.

1
2
3
4
5
wm.df %>% arrange(desc(X_coord)) #error is X_coord = 4845681.6, not a Y_coord either
sort(wm.df$X_coord,decreasing = T)[1:3] # look at the top 3 largest values
summary(wm.df$X_coord) #identify the error/outlier
wm.df[which(wm.df$X_coord == max(wm.df$X_coord)),] #identify the row with the error/outlier
wm.df <- wm.df[-which(wm.df$X_coord == max(wm.df$X_coord)),] #remove error 2000-01-22

Then for the Y_coord I do the same steps and discover there are three errors within this column. Again, the code below shows the steps I took to remove them.

1
2
3
4
wm.df %>% arrange(Y_coord) #first three Y_coord are duplicates of X_coord
sort(wm.df$Y_coord,decreasing = F)[1:3] # Y_coord errors, three
wm.df[which(wm.df$Y_coord %in% sort(wm.df$Y_coord,decreasing = F)[1:3]),]
wm.df <- wm.df[-which(wm.df$Y_coord %in% sort(wm.df$Y_coord,decreasing = F)[1:3]),] #remove these errors

Now I can aggregate the data by each week, month, and date simply using the count function from the dplyr package.

1
2
3
month.wm <- wm.df %>% count(month) #month
week.wm <- wm.df %>% count(week)
year.wm <- wm.df %>% count(Year)

Data Visualization

Then we can use ggplot2 create a simple plot that visualizes the number of watermain breaks per week from 1990 to 2016.

1
2
3
4
ggplot(data = week.wm, aes(x=week,y=n)) +
   geom_line() + labs(title = "Watermain Breaks in Toronto (1990-2016)",
                      x = "Year", y = "Number of Breaks per Week") +
  scale_x_datetime(date_breaks = "2 years", date_labels = "%Y") + theme_minimal()

simple_weekly_wm

It looks like there is some pattern over time with peaks occurring in a periodic fashion. Let’s take a look at the seasonality of watermain breaks by month and week. We can add a new column to our data frame that codes for the month using the mutate function.

1
2
3
4
5
6
7
8
mth.wm <- wm.df %>% group_by(month,Year) %>%
count(month) %>%
mutate(month_n = as.factor(month(month, label = T)))
mthwk.wm <- wm.df %>% group_by(week,month,Year) %>%
count(week,month,Year) %>%
mutate(month_n = as.factor(month(month, label = T))) %>%
mutate(yweek = week(week))

Then we can visualize this using boxplots representing the number of watermain breaks per month.

1
2
3
4
ggplot(mth.wm, aes(x=month_n,y=n)) +
geom_boxplot(aes(group=month_n)) +
labs(title = "Seasonality of Watermain Breaks in Toronto (1990-2016)",
x = "Month", y = "Number of Breaks") + theme_minimal()

mth_season_wm

Looking at the dark black line in each boxplot, we can see that there is a clear increase in the median number of watermain breaks in Toronto during the colder months (November, December, January, February). Let’s use the nifty animation package and create a .gif file to see how trend changes over the 27 year period. The boxplots in the animation below are for the entire time period from 1990 to 2016 and the changing line represents the median number of watermain breaks per month in the specified year in the title.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
library(animation)
aspect.w <span               data-mce-type="bookmark"                id="mce_SELREST_start"              data-mce-style="overflow:hidden;line-height:0"              style="overflow:hidden;line-height:0"           ></span><- 800
aspect.r <- 1.6
title.size <- 20
ani.options(ani.width=aspect.w,ani.height=aspect.w/aspect.r, units="px")
saveGIF( {
  for (i in unique(mth.wm$Year)) {
    g.loop <- ggplot(data=mth.wm, aes(x=month_n,y=n)) +
      geom_boxplot(aes(group=month_n)) +
      stat_summary(data = subset(mth.wm, Year == i), fun.y=median, geom="line",
                   aes(group=1), color = "#222FC8", alpha = 0.6, size = 2) +
      labs(title = paste0("Seasonality of Watermain Breaks in Toronto (",i,")"),
           x = "Month", y = "Number of Breaks per Month") +
      theme(plot.title = element_text(size = title.size, face = "bold"))
    print(g.loop)
    }
  }, movie.name = "wm_wm.gif",interval=0.9,nmax=30,2)

wm_wm

Eugene Joh

Eugene Joh

A detail-oriented MS and MPH graduate with a strong quantitative background, motivated to use his current and developing skill set to improve the health of vulnerable populations in local and global settings.

1