

Watermain Breaks in the City of Toronto
Data VisualizationData WranglingModelingStatisticsposted by Eugene Joh January 30, 2018 Eugene Joh

It has been a while since my last post due to the major transition of moving back to Canada. This post will be a bit shorter than my previous ones but hopefully it will give some insight on practically investigating and analyzing open data that are becoming more popular these days. All the code for this post is found at my Github.
Drinking Water
Water is a basic physiological need for humans. We can go days without food and be okay, but without water… that’s another story. Drinking water or potable water is something in the developed world we take for granted. Much of the world does not have great access to clean water and by United Nations had made their 6th Sustainable Development Goal to “ensure access to water and sanitation for all” and that all people would have universal access to safe and affordable water by 2030.
In Ontario (province of Canada) we are fortunate to have good water quality in most cities, towns, municipalities. It’s important to note that a disparity exists between regions across the province: large vs. small cities, southern vs. northern locations, and within indigenous communities. Across the province, there are different types of drinking water systems, various combinations of owners/operators, and (complex) regulatory network ensuring that the quality of the water reaching users is acceptable for consumption.
Watermains Breaks
Within these drinking water systems, the large pipes that carry most of the water through the system from the treatment plant through the distribution system are called watermains (or water mains). In the event of a watermain failing, either by a crack, some leaking or a catastrophic break, this can have potential effects on the flow of water to users. There can be loss of pressure in the distribution system, potentially allowing contaminants (chemical or microbiological) to enter the drinking water system due to the loss of pressure. No one wants to have E. coli or some industrial chemical in their tap water.
There are many factors behind a watermain failure. Most of the time there multiple concurrent factors that cause the pipe to crack or break. A non-exhaustive list of factors include type of pipe material, age of pipe, surrounding soil conditions, corrosion, pressure fluctuations, direct damage, extreme temperatures, extreme weather events, and poor system design.
Toronto Watermain Break Data
The City of Toronto has an Open Data Catalogue containing various types of datasets on a number of topics like environment, public safety, development, and government. You can follow them on Twitter to know when they update/refresh their data. The watermain break data can be found here. I downloaded the Excel flat-file containing the date and geographic location of watermain breaks from 1990 to 2016 within the city of Toronto boundaries. For this post I will be focusing on the temporal nature of the data, aka what do the trends of watermain breaks look like over time in Toronto. I ran into issues on converting the X,Y coordinate data into real latitude and longitude information so that will have to wait (potential part 2 in the future???).
Data Import, Exploration, Processing
After downloading the Excel file into my data folder, I use the list.files
function and some regular expression to identify the .xlsx files in the folder. This isn’t necessary since we only have one file in our folder but I’ve decided to make it good practice to have a standard method in selecting multiple files and file types in my data folder in a programmatic fashion. Afterwards I use the read_excel
function from the readxl
package to import the dataset into R.
1
2
|
watermain.files <- list.files ( "data" , pattern = "\.xlsx$" ) wm.df <- read_excel ( paste0 ( "data/" ,watermain.files)) |
We see that there are 35461 rows or (assumed) unique watermain breaks. After using the str
function to get a basic idea of the types of columns I have. I rename the column names using the names
function and create a new column with the year as a factor
type (I actually don’t end up using this).
1
2
3
4
|
wm.df <- read_excel ( paste0 ( "data/" ,watermain.files)) names (wm.df) <- c ( "Date" , "Year" , "X_coord" , "Y_coord" ) #change column names wm.df$Year_f <- as.factor (wm.df$Year) #create new column with 'Year' as factor wm.df$Year <- as.integer (wm.df$Year) #convert original 'Year' to an integer |
Next I use the floor_date
function from the lubridate
package to create new columns containing the month and week. This will be useful later on when I want to aggregate the frequency counts of watermain breaks by week or month.
1
2
|
wm.df$week <- floor_date (wm.df$Date, unit = "week" ) #floor to week wm.df$month <- floor_date (wm.df$Date, unit = "month" ) #floor to month |
Then I just create a new column that contains the same date information but instead of POSIXct
, it is in Date
format.
1
|
wm.df <- wm.df %>% mutate (date = as.Date (Date)) |
This is not related to the temporal information but a good step with the data exploratory workflow. I want to do a bit a quality control/assurance with the data I have. I use the summary
function to look at the Date, Year, X_coord, and Y_coord columns. First thing that catches my eye is that there seems to be some extreme values or outliers in the spatial information, specifically the X_coord maximum and Y_coord minimum. For geographic information, typical errors include inputting latitude when it should be longitude, and vice-versa.
1
2
3
4
5
6
7
8
|
wm.df %>% select (X_coord, Y_coord) %>% summary () X_coord Y_coord Min. : 294164 Min. : 304472 1st Qu.: 303580 1st Qu.:4838095 Median : 311195 Median :4842843 Mean : 311820 Mean :4841655 3rd Qu.: 318587 3rd Qu.:4846416 Max. :4845682 Max. :4855952 |
First I look at the X_coord first and discover that the maximum value is not likely a Y_coord and it is the only “error” in this column. The code below shows the steps I took to identify and remove this outlier.
1
2
3
4
5
|
wm.df %>% arrange ( desc (X_coord)) #error is X_coord = 4845681.6, not a Y_coord either sort (wm.df$X_coord,decreasing = T)[1:3] # look at the top 3 largest values summary (wm.df$X_coord) #identify the error/outlier wm.df[ which (wm.df$X_coord == max (wm.df$X_coord)),] #identify the row with the error/outlier wm.df <- wm.df[- which (wm.df$X_coord == max (wm.df$X_coord)),] #remove error 2000-01-22 |
Then for the Y_coord I do the same steps and discover there are three errors within this column. Again, the code below shows the steps I took to remove them.
1
2
3
4
|
wm.df %>% arrange (Y_coord) #first three Y_coord are duplicates of X_coord sort (wm.df$Y_coord,decreasing = F)[1:3] # Y_coord errors, three wm.df[ which (wm.df$Y_coord % in % sort (wm.df$Y_coord,decreasing = F)[1:3]),] wm.df <- wm.df[- which (wm.df$Y_coord % in % sort (wm.df$Y_coord,decreasing = F)[1:3]),] #remove these errors |
Now I can aggregate the data by each week, month, and date simply using the count
function from the dplyr
package.
1
2
3
|
month.wm <- wm.df %>% count (month) #month week.wm <- wm.df %>% count (week) year.wm <- wm.df %>% count (Year) |
Data Visualization
Then we can use ggplot2
create a simple plot that visualizes the number of watermain breaks per week from 1990 to 2016.
1
2
3
4
|
ggplot (data = week.wm, aes (x=week,y=n)) + geom_line () + labs (title = "Watermain Breaks in Toronto (1990-2016)" , x = "Year" , y = "Number of Breaks per Week" ) + scale_x_datetime (date_breaks = "2 years" , date_labels = "%Y" ) + theme_minimal () |
It looks like there is some pattern over time with peaks occurring in a periodic fashion. Let’s take a look at the seasonality of watermain breaks by month and week. We can add a new column to our data frame that codes for the month using the mutate
function.
1
2
3
4
5
6
7
8
|
mth.wm <- wm.df %>% group_by (month,Year) %>% count (month) %>% mutate (month_n = as.factor ( month (month, label = T))) mthwk.wm <- wm.df %>% group_by (week,month,Year) %>% count (week,month,Year) %>% mutate (month_n = as.factor ( month (month, label = T))) %>% mutate (yweek = week (week)) |
Then we can visualize this using boxplots representing the number of watermain breaks per month.
1
2
3
4
|
ggplot (mth.wm, aes (x=month_n,y=n)) + geom_boxplot ( aes (group=month_n)) + labs (title = "Seasonality of Watermain Breaks in Toronto (1990-2016)" , x = "Month" , y = "Number of Breaks" ) + theme_minimal () |
Looking at the dark black line in each boxplot, we can see that there is a clear increase in the median number of watermain breaks in Toronto during the colder months (November, December, January, February). Let’s use the nifty animation
package and create a .gif file to see how trend changes over the 27 year period. The boxplots in the animation below are for the entire time period from 1990 to 2016 and the changing line represents the median number of watermain breaks per month in the specified year in the title.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
library (animation) aspect.w <span data-mce-type= "bookmark" id= "mce_SELREST_start" data-mce-style= "overflow:hidden;line-height:0" style= "overflow:hidden;line-height:0" ></span><- 800 aspect.r <- 1.6 title.size <- 20 ani.options (ani.width=aspect.w,ani.height=aspect.w/aspect.r, units= "px" ) saveGIF ( { for (i in unique (mth.wm$Year)) { g.loop <- ggplot (data=mth.wm, aes (x=month_n,y=n)) + geom_boxplot ( aes (group=month_n)) + stat_summary (data = subset (mth.wm, Year == i), fun.y=median, geom= "line" , aes (group=1), color = "#222FC8" , alpha = 0.6, size = 2) + labs (title = paste0 ( "Seasonality of Watermain Breaks in Toronto (" ,i, ")" ), x = "Month" , y = "Number of Breaks per Month" ) + theme (plot.title = element_text (size = title.size, face = "bold" )) print (g.loop) } }, movie.name = "wm_wm.gif" ,interval=0.9,nmax=30,2) |