Has Vision Zero Made NYC Safer?


Editor’s note: Opinions expressed in this post do not necessarily reflect the views of #ODSC.


Oh, glorious, New York City. Concrete jungle and busy people. Although every New Yorker can give you a different version of “Why New York City is the Best City in the World,” there is one thing that is very unlikely to appear on that long list:


Driving in New York City is an absolute nightmare. Non-stop construction, countless one-ways, bus drivers, professional racers taxi drivers, the no-turn sign and camping cop combo, and headlights rendering cyclists and pedestrians blind…… OH my.

Just another busy day in NYC.

The real danger caused by these bad traffic conditions is large number of traffic accidents. In the year 2013 alone there were 297 people killed in traffic accidents within New York City. Therefore, in order to make the roads safer in this city, the city adopted a public safety program, Vision Zero, aimed to reduce traffic fatalities. Now, in July 2016, two and half years after the kick-off of this program, let’s use data science to find out whether Vision Zero has made New Yorkers safer on the roads.

About Vision Zero

Vision Zero originally started in Sweden in October 1997. Since then, countries around the world adopted the model.

Mayor Bill de Blasio launched NYC’s Vision Zero Plan in January 2014, two weeks after his succeeding Michael Bloomberg as the Mayor of New York City. Following the announcement, there was a long list of initiatives released by New York City agencies; for example, reducing the citywide street speed limit from 30 MPH to 25 MPH, installing more red light cameras, and increasing NYPD’s law enforcement against moving violations. Also as a part of this plan, a dataset has been released to the public and has been regularly updated by New York Police Department, which contains the records of each motor vehicle collision that happened within all five boroughs of NYC. This dataset has made it possible to quantify and review how Vision Zero has affected traffic safety in New York City.

About Data

Available at NYC OpenData Website, the motor vehicle collision records within this dataset were surprisingly well structured and relatively clean. As of June 30 2016, there were more than 830,000 records within the dataset, along with 29 variables that describe different aspects of each accident – date, time, location, vehicle type, contributing factors toward accidents, etc. However, before we start to work on numbers, there are some little issues with this dataset that we need solve.

First of all, let’s load data and R libraries for this analysis:

#### Load libraries for the project ####
library(data.table)   # Data Loading
library(dplyr)        # Data Manupulation
library(caTools)      # Data Manupulation
library(ggplot2)      # Visualization
library(ggmap)        # Visualization
library(maps)         # Visualization
library(RColorBrewer) # Visualization

#### Load Data Set ####
# You need to download this data by yourself. 
# See https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95
collision <- fread('NYPD_Motor_Vehicle_Collisions.csv', stringsAsFactors = F)

# Transform Date & Time (rounded to hour)
work_dt <- collision %>%  
  mutate(FULLDATE = as.POSIXct(paste(DATE, TIME), 
                               format = '%m/%d/%Y %H:%M',
                               tz = 'EST')
  ) %>%
  mutate(YEAR = year(FULLDATE), 
         MONTH = month(FULLDATE),
         DAY = mday(FULLDATE),
         HOUR = hour(round(FULLDATE, 'hour'))
  ) %>%

Time Range

Records in this dataset trace back to July 1st, 2012, which means in order to do a yearly comparison based on calendar years we’ll have to exclude 2012 and 2016. Those two years don’t have same time scale as 2013 – 2015 (See Figure 1).

# Let's visualize original time range
g <- ggplot(data = sample_set(work_dt),
            aes(x = factor(month(FULLDATE)), 
                y = YEAR))
g +
  geom_tile(aes(fill = factor(YEAR)), color = 'white') +
  # ggtitle('Time Range of NYC Motor Vehicle Collision Data') +
  labs(x = 'Month')+
  theme(legend.position = 'none', 

Figure 1. Time Range of NYC Motor Vehicle Collision Dataset

Fortunately, there are 48 consecutive months of data in this dataset; therefore we can break the records into four 12-month periods, where each period contains same time range and still covers a full cycle of twelve consecutive calendar months (See Figure 2).

# Data group for yearly comparison (Each period July - June)
work_dt[MONTH <  7, PERIOD := paste(YEAR - 1, YEAR, sep = '-')]
work_dt[MONTH >= 7, PERIOD := paste(YEAR, YEAR + 1, sep = '-')]
# Remove 2016.7, since it falls into period 2016-2017
work_dt <- work_dt[PERIOD != '2016-2017']

# Group by 12-month periods for analysis
g +
  geom_tile(aes(fill = PERIOD), color = 'white') +
  labs(x = 'Month') +
  # ggtitle('12 Month Periods For Yearly Comparison') +
  theme(legend.position = 'top',
        legend.title = element_text(face="bold"),

Figure 2. Customized 12-Month Periods For Yearly Comparison

Variable Selection

Since the target of this analysis is to evaluate the efficiency of Vision Zero, not all 29 variables within the dataset are useful for this purpose.  Therefore, number of accidentsfatalities, and injuries were selected as key factors to be used in this analysis.

#### Build subsets for each topic ####
# Temp data table for table construction
right_dt  <- work_dt %>%
  select(-starts_with('CONTRIBUTING FACTOR'), 
         -starts_with('VEHICLE TYPE CODE'), 
         ID = `UNIQUE KEY`)
# Injuries
injur_dt <- melt_count(work_dt, 'ends', 'INJURED', 'stat') %>% 
  merge(right_dt, by = 'ID')
# Deaths
kill_dt <- melt_count(work_dt, 'matches', 'KILLED', 'stat') %>% 
  merge(right_dt, by = 'ID')

Cold Numbers And Colorful Charts

Time to work on statistics!

Total Number of Accidents

Let’s look at the total number of accidents first. Figure 3 below shows the total number of accidents in NYC by each period:

# Total count by period
accident_n_dt <- work_dt %>%
  group_by(PERIOD) %>%
  summarise(N = n())

ggplot(data = accident_n_dt) +
  geom_bar(aes(x = PERIOD, y = N, fill = PERIOD), stat = 'identity') +
  labs(y = NULL, 
       title = 'Number of Motor Vehicle Collisions for Each 12-Month Period') +
  theme(legend.position = 'none',
        axis.title=element_text(size=14,face="bold")) +
  scale_y_continuous(labels = scales::comma) +
  geom_text(aes(x = PERIOD, y = N - 6000, label = N), size = 5)

Figure 3. Total Number of Accidents Each Period

Hmm…not very informative. It shows an increase of total numbers in each period, but given the large absolute numbers in each period, the changes are not very clear. Let’s try to pivot this chart into growth rates for a better view:

# Accident count percentage change
accident_p_vec <- c(0, tail(accident_n_dt$N, nrow(accident_n_dt) - 1) / 
                head(accident_n_dt$N, nrow(accident_n_dt) - 1) - 1)

# Number of registered vehicles in NYC
# https://dmv.ny.gov/about-dmv/statistical-summaries
reg_nyc <- data.table(Year = seq(2012, 2015),
                      Reg_num = c(1978392,

# Registration percentage change
reg_p_vec <- c(0, tail(reg_nyc$Reg_num, nrow(reg_nyc)-1) /
                 head(reg_nyc$Reg_num, nrow(reg_nyc)-1) - 1)

temp <- data.frame(PERIOD = rep(sort(unique(work_dt$PERIOD)), 2),
                   TYPE   = rep(c('Motor Vehicle Collision',
                                  'Vehicle Registration in NYC'), each = 4),
                   VALUE  = c(accident_p_vec, reg_p_vec))

# Visualize changes
ggplot(data = temp,
       aes(x = PERIOD, y = VALUE, color = TYPE, group = TYPE)) + 
  geom_point(size = 3) +
  geom_line(size = 2, alpha = .8) +
  scale_y_continuous(labels = scales::percent) +
  theme(legend.position = 'none') +
  labs(title = 'Yearly Growth Rate', y = '% Growth Since Last Year') +
  facet_wrap( ~ TYPE, nrow = 2)

Figure 4. Change Rates of Car Accidents and Vehicle Registration in NYC by Each Period

Figure 4 looks much better! Now look at the upper part of this chart, the red line illustrates the growth rate of the total number of car accidents in each period (growth in period 2012-2013 was 0% because it has no prior numbers to be compared with). It seems the number has been increasing in each period. Only period 2014-2015, which covers the time point of Vision Zero’s announcement,  shows a slower growth.

Now, does this suggest Vision Zero has been failing? Not necessarily. Look at the lower part of this chart, the blue line shows the percentage growth of vehicle registration in New York City (growth rates calculated based on data from NY DMV), and it has been growing at a similar rate as accident growth! Maybe that explains why accidents have been increasing consistently?

Let’s explore more about other aspects of accidents.


# Injury count by type
injur_n_dt <- injur_dt               %>%
  group_by(PERIOD, INJURED)          %>%
  summarise(Injury = sum(INJURED_V))  %>%
  filter(!grepl('PERSONS', INJURED)) %>% # Num. of persons is the total of 
  # other three stats
  arrange(desc(Injury)) %>%
  mutate(INJURED = INJURED %>%
           gsub('NUMBER OF ', '', .) %>%
           gsub(' INJURED', '', .) %>% 
           factor(levels = c('CYCLIST', 

# Injury total by period
injur_n_label_dt <- injur_n_dt %>%
  group_by(PERIOD) %>%
  summarise(N = sum(Injury))

## Visualization
g <- ggplot(data = injur_n_dt) +
  geom_bar(aes(x = PERIOD, y = Injury, fill = reorder(INJURED, Injury)), 
           stat = 'identity') +
  scale_fill_manual(values = c("#468966", "#77C4D3", "#FFB03B")) +
  guides(fill = guide_legend(ncol = 3, title = 'Injury Type')) +
  theme(legend.position = 'top',

# Injury per Period - Total
g + geom_text(data = injur_n_label_dt,
              aes(x = PERIOD, y = N + 2000, label = N), size = 5)

# Break down
g + facet_wrap(~ INJURED)

Figure 5. Total Number of Injuries in NYC Per Period

Figure 6. Total Number of Injuries in NYC Per Period by Types

# Yearly Average Injury per Accident
temp <- injur_n_label_dt %>% 
  mutate(injur_rate = N / accident_n_dt$N)

ggplot(data = temp, aes(x = PERIOD, y = injur_rate)) +
  geom_line(group = 1, size = 2, alpha = .5, color = 'red') +
  geom_label(aes(label = round(injur_rate, 3))) +
  labs(x = NULL, y = 'Injury Per Accident') +
  theme(legend.position = 'none', 
        axis.title=element_text(size=14,face="bold")) +
  coord_cartesian(ylim = c(0, .5))

Figure 5 – 7 summarized total injuries caused by traffic accidents in each period. Interestingly… although total accidents are growing, injuries are relatively stable — except in 2014-2015 period, which had a significant drop in injury count.

What about each injury type?

Injuries broken down by category show a mixed pattern: Cyclist injuries, although small in absolute numbers, has been consistently climbing over the years; Pedestrian injuries stayed low after the big drop in 2014-2015; Motorist injuries, as the most common injury type, followed the exact same drop-bounce pattern as overall numbers.

Increasing accidents and fluctuating injuries.  Things are not looking good well so far. Before making a conclusion to this analysis, the last but most important statistical figure needs to be checked – traffic fatalities.



# Data
death_n_dt <- kill_dt               %>%
  group_by(PERIOD, KILLED)          %>%
  summarise(Death = sum(KILLED_V))  %>%
  filter(!grepl('PERSONS', KILLED)) %>% # Num. of persons is the total of 
                                        # other three stats

death_n_dt$KILLED <- death_n_dt$KILLED %>% 
  gsub('NUMBER OF ', '', .) %>%
  gsub(' KILLED', '', .)

death_n_label_dt <- death_n_dt %>%
  group_by(PERIOD) %>%
  summarise(N = sum(Death))

# Plotting
g <- ggplot(data = death_n_dt) +
  geom_bar(aes(x = PERIOD, y = Death, fill = reorder(KILLED, Death)), 
           stat = 'identity') +
  scale_fill_manual(values = c("#468966", "#77C4D3", "#FFB03B")) +
  guides(fill = guide_legend(ncol = 3, title = 'Killed Type')) +
  theme(legend.position = 'top',

g + geom_text(data = death_n_label_dt,
              aes(x = PERIOD, y = N + 20, label = N), size = 5)
# Break down
g + facet_wrap(~ KILLED)
Figure 8. Total Number of Deaths in NYC Per Period

Figure 9. Total Number of Deaths in NYC per Period, Breakdown by Victim Types

# deaths per 1000 accidents
temp <- data.frame(PERIOD = death_n_label_dt$PERIOD,
                   d_rate = death_n_label_dt$N / accident_n_dt$N * 1000)
ggplot(data = temp, aes(x = PERIOD, y = d_rate)) +
  geom_point(shape = 16, size = 3, color = 'red', alpha = .5) +
  geom_line(aes(group = 1), size = 2, color = 'red', alpha = .5) +
  labs(x = NULL, y = 'Deaths Per 1000 Accidents') +
  theme(legend.position = 'none') +
  geom_label(aes(y = d_rate + .03, label = round(d_rate, 2)))

Figure 10. Deaths per 1000 Motor Vehicle Collisions in NYC per 12-month Period

Look, a positive outcome! After its peak in the period 2013-2014, total traffic fatalities in NYC dropped 29% from 294 to 228 per 12-month period, and the breakdown of the death count by each victim type confirmed the same improvement (except for the consistently growing number of cyclists victims, 🙁 poor cyclists ). Given the fact that total accident number has been increasing, this drop in traffic fatalities is a real achievement, as deaths per 1000 accidents decreased 39% from nearly 1.5 before Vision Zero’s announcement, to slightly above 1 in the last period.


What’s More?

Given so much information provided in this Motor Vehicle Collision dataset, we can do a lot more analysis to explore NYC’s traffic conditions. For example, we can use coordinate info to find locations with the most motor vehicle collisions in New York City (coordinates rounded to 2 decimal points):

# Load NYC map
# Since this line actually run a query on Google Maps, 
# You might want to save this map as RData for furture use
# If you don't know how to do it, please check save() and load() functions
# in the comment line below
nyc_map <- get_map(location = find_map_cent(mvc_count$LON,
                   zoom = 11)
# save(nyc_map, file = 'nyc_map.RData')
# load('nyc_map.RData')

# Accident count by location
mvc_count <- work_dt %>%
  filter(!(is.na(LATITUDE) | is.na(LONGITUDE))) %>%
  mutate(LAT = round(LATITUDE, 2),
         LON = round(LONGITUDE, 2)) %>%
  group_by(LAT, LON) %>%
  summarise(N = n())

# Top deaths by location
mvc_dth <- kill_dt %>%
  filter(!(is.na(LATITUDE) | is.na(LONGITUDE))) %>%
  filter(grepl('PERSONS', KILLED)) %>%
  mutate(LAT = round(LATITUDE, 2),
         LON = round(LONGITUDE, 2)) %>%
  select(KILLED_V, LAT, LON) %>%
  group_by(LAT, LON) %>%
  summarise(KILLED_V = sum(KILLED_V)) %>%
  arrange(desc(KILLED_V)) %>%
  # table(mvc_sum$KILLED_V)
  #   1   2   3   4   5 {}  6   7   8  10 
  # 231 121  50  17  13 {}  8   4   2   1
                      # ^ Break here
  filter(KILLED_V >5)

# Mapping accident counts
ggmap(nyc_map, extent = 'device') +
  geom_point(data = mvc_count, alpha = .5,
             aes(x = LON, y = LAT, color = N, size = N)) +
  scale_color_gradient(low = 'white', high = 'red') +
  theme(legend.justification=c(1,1), legend.position=c(.95,.95),
        legend.background = element_rect(color = 'black',
                                         fill = alpha('white', 0.8)),
        legend.title = element_blank()) +
  guides(size = FALSE) +
  ggtitle('Accident Frequencies by Location')

Ha! Looks like the majority of accidents happened in Manhattan; more specifically, Midtown Manhattan and Soho-Little Italy look like traffic accident black holes on this map. This seems very reasonable because those two areas are lying right between the major paths in and out of this narrow borough — Lincoln Tunnel, Queensboro Bridge, Queens-Midtown Tunnel, Holland Tunnel and Williamsburg Bridge. Maybe that’s why Google Maps directs users who want to cross Manhattan to head north?

Similarly, another map can be made to find locations with highest number of deaths since July 1, 2012:

# Top Locations with most deaths
ggmap(nyc_map, extent = 'device') +
  geom_point(data = mvc_dth,
             aes(x = LON, y = LAT,
                 size = KILLED_V,
                 fill = KILLED_V),
             shape = 21, alpha = .5) +
  scale_size_area(max_size = 10) +
  scale_fill_continuous(low = 'yellow', high = 'red') +
  geom_text(data = mvc_dth, 
            aes(x = LON, y = LAT, label = KILLED_V),
            size = 4, check_overlap = T, fontface = "bold") +
  theme(legend.position = 'none') +
  ggtitle('Deadlist Locations July 2012 - June 2016')


Hmm… Look at those scary dots around Central Park — Midtown Manhattan, Morningside Heights, and East Harlem. This kind of chart can definitely help decision makers to decide which areas should have more traffic control devices installed.


Has Vision Zero Made NYC Safer?

While there is not enough evidence of better conditions in total accidents and injuries, death related statistics did improve significantly after Vision Zero’s initiation. Considering that the top priority of this plan is to reduce traffic fatalities, I would say Vision Zero has indeed been making NYC safer by making this city’s traffic less fatal.

What needs to be improved

Although total accidents and injuries appeared to reduce for a short period after the NYC’s Vision Zero announcement, numbers pushed back later. Therefore, NYC government should continue working hard to improve facilities and road conditions in order to reduce traffic related injuries.

Secondly, it seems Vision Zero does not benefit cyclists in New York City. Both injuries and deaths of cyclists caused by traffic accidents grew consistently from 2012-2016. Road conditions in NYC are still quite hostile to those bicycle riders; there are very limited bicycle lanes across all five boroughs. Therefore, city government may need to consider paying more attention to secure this group’s safety on roads. With the advent of public bicycle sharing systems, like Citibike, that serve many busy areas in NYC, one can imagine that the transit option will grow in popularity.

New York, New York

Assuming that Midtown Manhattan has the busiest traffic, it is not surprising that there are more injuries and fatalities there than any other part of NYC. The city should recognized this large amount of traffic fatalities in upper Manhattan and take action to analyze and improve traffic safety in those areas.

Contributed byJonathan Liu. He takes the NYC Data Science Academy 12 week full time Data Science Bootcamp program from July 5th to September 22nd, 2016. This post is based on their first class project – the Exploratory Data Analysis Visualization Project, due on the 2nd week of the program. You can find the original article here The complete R Script used in this article is available on Jonathan’s Github