When you hear the phrase “corporate data science,” what comes to mind? You may think of recommender systems for video streaming platforms or optimizing the click-through rate for online advertisements. When thinking about “academic data science,” maybe you are reminded of some white papers on arXiv.org with tons of linear algebra and multivariate calculus. What comes to mind when you think of “civic data science?”
Before I joined Code for Boston nearly two years ago, the idea of using data science in the context of non-academic, non-profit community work was not something I had considered. Today, I am proud to say that our team at Code For Boston, the Safe Water team, put our data science chops to good use for the Charles River Watershed Association (CRWA), the organization that oversees monitoring and reporting for the Charles River. We conducted a few separate projects for the CRWA:
1. A publicly available ArcGIS dashboard and data ingestion pipeline for their monthly water sample results. You can see this work here.
2. A web app that reads in live data and displays predictions regarding the safety of the Charles River for recreational boating. You can see this work here.
3. Surveying and research into the users of CRWA’s website for information that may assist in future website design.
Our work automated and modernized many parts of the CRWA’s technical infrastructure. Code For Boston’s work will save the CRWA countless hours of tedious in-house work per year, which gives them more time to focus on other more specialized and interesting projects for the greater Boston community, such as collecting cyanobacteria samples from the Charles River.
Struggles and false starts
Most Code For Boston teams start with an external stakeholder with a software related challenge, and requirements are scoped out in conversations with the stakeholder. The Safe Water team started backwards: a group of data scientists in search of a civic problem. In the meantime, we did data science on publicly available data, becoming subject matter experts on drinking water, and tried to predict water contamination levels in drinking water systems across the United States.
Eventually it became clear that there were a few challenges with this approach:
1. Prediction is the act of providing an estimate for a value in lieu of observing it. Often you will do this because the value is difficult or impossible to measure at time of prediction (e.g. it’s a future value). The EPA requires public drinking water systems to monitor for various contaminants. So, what is there to predict if these values are already measured? Were we trying to predict the future? We never moved beyond contemporaneous predictions before we switched gears, but even if we had, wouldn’t future contamination be best predicted by prior contamination?
2. Was anyone interested in what we were doing? Would this work end up benefiting anyone? Drinking water contamination is an important issue, but many people already work on water contamination reporting. It was unclear that we were on a path to contribute anything valuable. We were aware of this problem, and at one point focused a lot of our energy into PFAS research to do something unique. However, we learned one reason there is little research into PFAS and its effects is because there’s very little data on PFAS.
3. It was difficult to convince new volunteers that the best value-add to the project was data collection and cleaning. We had already done seemingly everything we could with the EPA data we had, and there was no clear benefit to running yet another logistic regression. This is partly a project management challenge (convincing volunteers to do the less fun yet necessary parts of data science work), but it was also a project conceptualization challenge: because our goals were ill-defined, the tasks required to achieve the goal were not clear either.
As these problems became apparent, we stopped worrying about building machine learning models, and instead focused on prospective stakeholder outreach and defining our goals. Andrew Seeder deserves a lot of praise here — he was able to gather a list of interested prospective stakeholders that were doing water-related work.
We narrowed the list down and decided to partner with the Charles River Watershed Association. Our data science project finally had a stakeholder that used data for its day-to-day operations, including the maintenance of a predictive model used for E. coli monitoring of the Charles River. Their existing infrastructure (Google Sheets and an old Microsoft Access database) relied on a lot of ad hoc and manual data ingestion and reporting, and our goal was to modernize this infrastructure.
Did the machines learn?
The prototypical data science workflow is all about gathering data, analyzing data, transforming data, and algorithms. This idea of data science imagines data science projects like this:
This view of data science makes sense for research work. When designing a system that takes in live data, however, the algorithm often ends up being a small part of the bigger picture. This is what our final system design looks like for the CRWA’s flagging website:
As you can see in this case, the algorithm, while still central to the overall purpose of the work, is a much smaller portion of the overall project.
The CRWA’s problem was not that they needed to swap out one predictive model for a better model. They already had a model, and we had no reason to believe it was predicting poorly. Their problem was that they did not have an effective system for running this predictive model.
It’s safe to say that building a system to run this model automatically is a major quality of life improvement for the CRWA. This was their old system:
– Download the data manually in Google Sheets.
– Clean it if need be. (The weather station device they use occasionally has some data cleanliness issues.)
– Run it through a Google Sheet that does feature transformations.
– Update a static HTML web page to reflect the new outputs.
And this is their new system:
– Do nothing. (It’s all done automatically.)
It’s fair to ask if this counts as “data science” work as opposed to being full-stack engineering. After all, the machines didn’t do any learning. Experienced private sector data scientists know that a lot of the typical data science day-to-day is, as the New York Times puts it, “janitor work”: cleaning data, gathering data, SQL queries, building tracking for your models, one-off analytics tasks, designing systems for production, and occasionally actually building models. The work we did for the CRWA was an even more extreme version of this dynamic. Scikit-Learn was never included in our website’s
In the private sector, it’s easy to take for granted that there will be machine learning infrastructure necessary to run and deploy models. But that will only exist at a company because of lots of tedious engineering work.
Building out a full reporting system around an existing model has given us an appreciation for the hard work involved in designing data systems we often take for granted in the private sector. These systems only exist as a result of lots of tedious and specialized platform engineering work.
There are two main takeaways from this section on civic data science:
1. Say thank you to your organization’s platform engineers.
2. Data scientists working entirely out of Jupyter Notebooks should give the system engineering part of the work a try. By doing so, you will gain a greater appreciation of how challenging that side of the work is, and you will become a more well-rounded data scientist.
Civic data science is pragmatic and rewarding
In writing this article, I spoke with a few data scientists in the non-profit sector to understand their experiences and how they compared to ours. Were platform limitations and data ingestion bigger bottlenecks than algorithmic ingenuity elsewhere in the civic tech world? Or was the Water Team’s experience an aberration?
I first spoke with Brittany Bennett, who wrote about non-profit data work in her own words. Her piece and our conversation outlined the necessity of being pragmatically minded in the civic data science tech world, where budgets are low, infrastructure is limited, subject matter expertise is key, and fast turnaround on simple compelling analytics is the name of the game.
I also spoke with Gowtham Asokan, a project manager at BU Spark!, which is a technology incubator at Boston University that works on civic data science projects. There, they connect external stakeholders with students in a way that’s similar to Code For Boston’s organizing model, except the students are getting course credit. Some of their projects include collaborating with the ACLU Massachusetts on data analysis of political donations by police officers, and a website to help citizens identify victims of modern slavery.
Subject matter experts are a key part of what makes BU Spark successful. “We really focus on an interdisciplinary approach,” Gowtham told me. “We work in the civic tech space overall, but we bring in folks from biomedical sciences, social sciences, traditional CS backgrounds, and so on.”
In our conversation, I told Gowtham that the Water Project struggled with attracting new members when they learned our data science had less algorithmic work and more dirty work than the volunteers initially expected, and I asked him if he faced a similar challenge. “Not a lot of students are surprised by the dirty work, which is surprising,” he said. “A lot of students are interested in the class precisely because it works on real civic problems. They really enjoy the social impact the project is having.”
This resonated a lot with me; utilizing my technical skills for the greater good was what drew me to Code For Boston in the first place. In pitching my project to new members, I emphasized that it was a great project if they’re interested in data science, and I described the technical stack we were working with.
Before building a house, you need to lay a foundation. In civic projects, data is going to be messy. The methods can be simple, but their implementation is complex. Using the tools and techniques we have as data scientists, we were able to build data pipelines that work for our partners. We made data science work for them. Hopefully, this foundation leads to future projects where we can help them with models, prediction, and the other techniques we love and value as data scientists.
Water Project Contributors
- Bertie Ancona
- Ben Chang
- Dan Eder
- Cameron Reaves
- Daniel Reeves
- Alex Rutfield
- Lewis Staples
- Edward Sun
- Dani Valades
- Emi Gaal
- Chris Larue
- Josie-Dee Seagren
E. coli dashboard:
- SJ Clarissa Choi
- Francois Delavy
- Jessy H.
- Amanda Holmes
- Kevin Liu
- Bhushan Choudhari
- Nicholas Jin
- Francois Delavy
- Daniel Reeves
- Andrew Seeder
- Anita Yip
About the Author of Civic Data Science
Daniel Reeves is a data scientist at Hopper, and a volunteer at Code For Boston. Code For Boston is a volunteer organization that uses technology to solve social and civic challenges. We meet on Tuesday nights, and new members are always welcome.