September has been an impressive month for data science research.
Here, we highlight a few innovative and explosive studies released on the arXiv research aggregator out of Cornell University Library. This research dives into some of the most important facets of data science today, including deep learning, machine learning, artificial intelligence, and natural language processing.
Artificial intelligence: Addressing health inequality in the United States
Researchers fear artificial intelligence technology could be widening healthcare inequality. But a team from Stanford and two Indian technology institutes used AI to identify actionable public policy to address the life expectancy gaps that exist in the United States.
The Stanford researchers built a Bayesian Decision Network using county-level data with healthcare, socio-economic, behavioral, education, and demographic features — life components where inequities often result in decreased life expectancy. They also quantified the impacts of diversity, preventive-care quality, and stable families.
The AI developed an understanding of health-inequality in the United States and answered directed questions. When researchers queried “What minimizes the longevity gap between the lowest and the highest income quartiles in the Health Inequality Data?” the framework determined it was the population diversity of the county. When they asked “What maximizes the mean life expectancy in males and females?” it answered preventative care quality.
Researchers then created an interactive web application for users and policymakers to more easily find the policies and solutions that the framework recommends, which will eventually be available on GitHub.
Machine learning: Detecting hate speech and offensive language on Twitter
Research out of India’s Maharashtra Institute of Technology reports a model with 95 percent accuracy at differentiating between hate speech, offensive language, and clean language.
In seeking out “toxic content” in tweets, the logistic regression model was able to differentiate between offensive tweets — those containing slurs and derogatory terms — and hate speech, which attacks a person or group based on attributes like race, ethnicity, religion, gender, sexuality, and disability. The model makes these classifications based on the term frequency-inverse document frequency (TFIDF) values of n-grams.
The authors also created an application that filters hateful and offensive tweets by a user or identifies those tweets on their timeline. A model like this can be powerful in an age where the anonymity of the internet encourages cyberbullying, racism, and outrageous verbal assaults. But data science experts worry that these models could also be utilized to track political dissent and limit free speech.
Deep learning: Automated classification of sleep stages
Classification of sleep stages — or sleep scoring — is essential in sleep studies. A group of four researchers just found a way to automate the tedious and lengthy process in mice by feeding recordings of the electrical activity in their brains and skeletal muscles into a deep neural network. They expect similar models will enable long-term sleep studies and help researchers be more productive with their time and funding.
Sleep scoring has typically been based on manually-defined features, but the study’s authors show that with enough data, models can learn predictive features automatically. The model was also able to classify the data in a higher time resolution, allowing the researchers to track mice’s frequent changes in sleep stages.
Their model correctly classified more than 90 percent of REM, non-REM, and wake stages in time periods not influenced by the study’s procedures. The model’s accuracy fell during periods that were impacted by the study’s procedures, but they found feeding in more recording channels could partially counteract that.
Natural Language Processing: Detecting sarcasm and irony
Taking on the challenge of detecting and predicting irony and sarcasm in natural language processing, researchers from universities in Austria and Japan modeled a new approach.
The model used character-level vector representations of words. These were based on ELMo, which represents words through the complex characteristics of their use — syntax and semantics — and their different context-based uses.
The researchers tested their model on three Twitter datasets, two Reddit datasets, and two Sarcasm Corpus Online dialogues. They reported their model was state-of-the-art at detecting sarcasm within six of these and outperformed nearly all previous methods in precision. In their conclusion, the researchers found manually-annotated data may be necessary to improve performance.
These studies could have profound implications. Some are more obvious and direct, like influencing proposed policy to make healthcare equitable and lessen the longevity gap. Others, like detecting sarcasm in natural language processing, push the technology a step further to become increasingly useful and impactful in people’s daily lives.