Why Personal COVID-19 Vaccination Data Should Remain Private Why Personal COVID-19 Vaccination Data Should Remain Private
Earlier this month, it was reported that the Centers for Disease Control and Prevention (CDC) instructed states to sign data use... Why Personal COVID-19 Vaccination Data Should Remain Private

Earlier this month, it was reported that the Centers for Disease Control and Prevention (CDC) instructed states to sign data use agreements that commit them for the first time to sharing personal information—names, birth dates, ethnicities, and addresses—in existing registries with the federal government. This comes as part of an effort to help us understand more about the spread and efficacy of the newly approved COVID-19 vaccination. As expected, some states are refusing to share this COVID-19 vaccination data information, citing fear that a federal COVID-19 vaccination data registry could be misused or compromised. But at what point does the public health benefit outweigh privacy concerns?

To put it simply, it doesn’t. And since the original story broke, it has been reported that the federal government will no longer require states to share personal identifiers when distributing the COVID-19 vaccine. That said, all but a handful of states agreed to turn over the original information requested, which sets an unsettling precedent for how our data is collected and used—and the potential consequences it poses. As the news cycle moves forward, we’re now in the midst of one of the worst security breaches experienced by several US government entities, which should make us question how our data is handled and just how vulnerable it really is.

Through the lens of a global pandemic, it may seem counterintuitive to refuse this sharing of information, but in addition to the risks associated with collecting personal COVID-19 vaccination data, there’s virtually no benefit to it. We don’t currently collect this data at the state or federal level and there’s no reason to. At the population level, it’s valuable for the government to know how many people are immunized, but you don’t need to share personal data to achieve this. A local pharmacy that administers the vaccine can tell you how many doses they administered, which when combined with other vaccination sites, would imply what percentage of the population is immunized in each zip code.

Despite being broader, there’s actually a lot we can learn from collecting population data versus personal data. Let’s take Google Flu Tracker, for example. Ultimately, this was not a successful project, but there were a lot of insights we can use to observe patterns in how sickness spreads. With the common flu, often there’s undercounting because people typically don’t get tested. Some people show no symptoms at all. Even with underreporting, social surveillance gives us visibility into what related topics people are searching for and talking about online, and can correlate that to which zip codes are getting infected. While we can’t make any inference from one case, at a population level, if a number of people in one area are searching for or talking about flu symptoms, disease data from previous years can be used to estimate where and how fast a disease is spreading.

By referencing flu data over the past several years, you would be able to compare them to actual flu symptoms reported by healthcare providers. In the past five years, whenever we’ve experienced a spike in searches or social media about symptoms, we see more confirmed symptoms in that local area several days later. Based on past and present data at the population level, we can determine how strong the correlation is. This requires no sharing of personal data, and a minimal effort to benchmark the data from years past.

On the other hand, let’s play out the scenario that the CDC was successful in its quest for state-sanctioned personal COVID-19 vaccine data. California, Washington, New York, and Massachusetts refuse to sign over this information to the federal government, and the rest of the country complies. Even if you get some insight from the information collected, you exclude a significant chunk of the population, so the results will be biased and misrepresentative of the greater US. For example, immigration organizations expressed concerns that personal identifiers could also expose undocumented immigrants, which would hinder them from getting vaccinated. Instances like this are already a big challenge to our healthcare system. For example, most clinical trial participants come from a handful of states, discounting entire geographies and demographic groups that aren’t being tested before treatments go to market.

The only reason for needing personal COVID-19 vaccination data is to correlate immunization to other health symptoms of the same patients. For example, this would give visibility into how many people developed a fever, rash, or other negative side effects after the treatment. However, to make this happen you would need the patient’s full medical record, which we do not share, intend to share, or are technically able to share at scale. Therefore, there is little value here for the overhead. As it stands, the collection of this personally identifiable health data, and any access or use of this data would be subject to HIPAA.

If we’re looking for better and faster real-world evidence about the safety and efficacy of vaccines, the better approach is to leverage large healthcare systems that already have the full, fine-grained clinical data about patients and can link medical histories, pre-existing conditions, and timing of when vaccines are administered. Funding these entities to carry out research with the information they’ve already collected—and requiring them to share it publicly—may produce the most relevant results. Such an approach does not require a new data collection initiative or the overhead and privacy risks it implies, and will also accelerate similar future efforts beyond the COVID-19 pandemic.

It’s not a wise practice to store sensitive, personal data of an entire population in one place unless there is a very strong reason. Trying to justify using personal information in the name of COVID-19 research is not that reason, and above all, is not necessary for discovery purposes. Just determining which government agencies and teams would have access to the data and for what reasons would take longer than the time span in which the data would be relevant. By using population-level data, and benchmarking it with data we already have access to, we can eliminate the need to compromise information, misrepresent the population, and still glean the insights we need about the game-changing COVID-19 vaccine, and others to come.

David Talby

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.