The Psychic Syndrome: How the Data Science Community Forgot About the Data The Psychic Syndrome: How the Data Science Community Forgot About the Data
When scrolling through social media in March of this year, I could not help but notice the overwhelming amount of data... The Psychic Syndrome: How the Data Science Community Forgot About the Data

When scrolling through social media in March of this year, I could not help but notice the overwhelming amount of data science projects on COVID-19. At some point, it seemed like all LinkedIn or Twitter consisted of were forecasts of how the pandemic might play out over the coming weeks. I never really gave this phenomenon much thought. Instead of spending all day reading about how everything will be just fine in one article before learning about the impending Armageddon in the next, I decided to put my phone away more often. However, I recently had the chance to reflect on the dynamics of the data science community at large and arrived at an observation that is equally as concerning as it is absurd:

While there was little data available, lots of data science projects were completed. Now that far more data has been published, there are significantly less data science projects on COVID-19.

It is widely accepted that the amount of collective attention paid to an event has been steadily declining over the past decades, especially after the arrival of social media in the mid-2000s. A 2019 study by the Technical University of Denmark found that “shorter attention cycles are mainly driven by increasing information flows”, suggesting that our addiction to a constant stream of new stimuli pushes us to search for new content faster when the topic currently en vogue is drawing plenty of attention. Although worrisome, this trend is understandable with regards to news or entertainment content that does not require plenty of research and can be quickly published.

Nevertheless, this reasoning fails to explain why so many data science projects were conducted when little to no data was available.

The Psychic Syndrome

Another potential explanation emerges when putting this observation into context. 2020 has been a year of uncertainty and hardship for many due to COVID-19. The problem with uncertainty is that it is uncomfortable. Nobody likes not knowing how events directly affecting one’s personal life might pan out and quick answers provide a (false) sense of control and security. After all, it is probably no coincidence that the very first astrology newspaper column was published shortly after the stock market crash in late 1929. Applying this to the way the data science community reacted when COVID-19 cases started increasing lead me to formulate the following hypothesis:

Image for post

(Image by Author)

If true, this would be both absurd and counterintuitive. Since data science projects have to rely both on the amount and quality of data, projects completed shortly after a new topic appears offer less reliable results than projects that build on data collected and reviewed for reliability over a longer period of time. Naively, one might assume that data scientists are generally leaning towards rational thinking as opposed to dysrationalia and should, therefore, be aware of this relationship.

As this is just a hypothesis, let us examine the results of some exploratory research of the data science community’s behavior on Twitter and Kaggle. If you are interested, find a Deepnote notebook with Twitter API code templates here and a GitHub Gist with tweet count data here.

Data Science During COVID-19 On Twitter

In order to get a sense of the data science community’s behavior on Twitter during the pandemic, I analyzed the usage of some hashtags related to data science and COVID-19 over the course of this year. If the psychic syndrome hypothesis holds any value, the assumption would be that lots of people shared their projects around March and April of 2020 followed by a drop in related activity during summer and autumn.

Image for post

(Image by Author)

Looking at #datascience in combination with either #coronavirus#covid, or #covid19 gives an interesting first insight. Apparently, the combinations #datascience AND #coronavirus as well as #datascience AND #covid19 outperformed #datascience AND #covid. Simultaneously, both of the more popular hashtag combinations spiked around March/April 2020 before experiencing a sharp drop in usage. The most popular combination, #datascience AND #covid19, peaked in April with 12,491 tweets while there were only 5,533 tweets in October.

Tweet Contents In March Versus September

Taking a sample of 1,000 tweets from each March and September 2020 and visualizing the contents using a graph of common bigrams presents an opportunity of extracting the prevalent themes at each point in time through community detection (more specifically, the Louvain method in this case).

The graph below represents common bigrams in 1,000 tweets from March 2020. Evidently, there are a few themes present in the contents. For example, the dark orange community contains word such as “help”, “better”, and “world”, indicating an intention of helping to relieve problems associated with COVID-19 through data science.

Image for post

Network Graph of Common Bigrams in March 2020 (Image by Author)

Nevertheless, the light blue community to the left centered around the node “free”, relating to free access to learning resources, as well as the dark green community containing nodes like “projects” and “placements” may also hint at projects centered around learning and landing jobs associated with COVID-19.

Image for post

This tweet from March 18th provides a good example of the early trends in the data science community during the pandemic. According to the CDC, there were 216 confirmed COVID-19 cases in the United States on March 18th, illustrating just how little data was available. It also shows how little value most early predictions provided given what we now know about the spread of COVID-19. In retrospective, most projects heavily suffered from overfitting to the very limited amount of data available for modeling.

Looking at the graph from September tweets, it becomes clear that there are some differences. The communities related to projects in association with job placements seem to have disappeared while the central node of the light blue community, “daysofcode” (relating to the 100 Days of Code challenge), hints at an increased presence of learning projects related to COVID-19 data.

Image for post

Network Graph of Common Bigrams in September 2020 (Image by Author)

Generally, it seems that during peak popularity many personal projects, especially focusing on case number forecasts, were conducted whereas the focus in September shifted to using COVID-19 data for learning purposes. The types of communities and tweets seem to support the notion of increased attention to a topic from the data science community in early stages and fading interest afterwards. The texts also suggest that, in addition to providing quick answers and shortening collective attention spans, projects in relation to job placements seem to play a role in the initial wave of interest.

Data Science During COVID-19 On Kaggle

Kaggle usually serves as a good indicator of the data science community’s focus due to it allowing the upload of user data as well as the creation of kernels (notebooks). Following the psychic syndrome hypothesis, user activity related to COVID-19 should have reached its maximum around March and April as well.

Image for post

(Image by Author)

Using the last run of a Kaggle kernel (notebook) or update of a dataset tagged “covid19” as popularity measures, one can easily observe that this plot bears an uncanny resemblance to the popularity of the COVID-19 related Twitter hashtags. This might serve as yet another supporting argument for the psychic syndrome hypothesis.

Taking Responsibility

With all explored indicators pointing to the psychic syndrome hypothesis holding at least some merit with regards to the COVID-19 pandemic, it seems as if there are certain awareness issues within the data science community with regards to its responsibility in the grander scheme of things.

While it may be tempting to throw algorithms at brand new problems, a careful examination of the surrounding environment, as well as the consequences of spreading predictions of any sort, should precede any modeling efforts. Data scientists live in constant uncertainty and even if they themselves might be aware of such when conducting projects of grave societal importance, people that these predictions will be shared with might not. Therefore, in my opinion, it was irresponsible to treat COVID-19 data the same way as the Iris data set. In a world that pays less and less attention, carefully considering, conducting, and wording projects becomes more and more relevant if data science and its community want to remain credible and objective advisors to the public during times of crisis in the future.

Code template for Twitter API’s:

Start your own Twitter analysis with the templates I used in this Deepnote notebook.


[1] Lorenz-Spreen, P., Mønsted, B.M., Hövel, P. et al. Accelerating dynamics of collective attention. Nat Commun 10, 1759 (2019). https://doi.org/10.1038/s41467-019-09311-w

[2] Smallwood, C., Monroe, R., & Miller, L. (2019, October 21). Astrology in the Age of Uncertainty. Retrieved November 07, 2020, from https://www.newyorker.com/magazine/2019/10/28/astrology-in-the-age-of-uncertainty

Reposted with permission. Source.

ODSC Community

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.