When this blog site first started years ago in conjunction with ODSC to supplement our events, we never expected it to become such a go-to resource for the greater data science community. Our initial hope was for this site to be a little something extra for those who are interested in our events, and in recent years, it has grown into its own standalone platform. Over the past year, we posted hundreds of data science articles, blogs, tutorials, framework overviews, opinion pieces, and event recaps, all devoted to educating our community about what’s important in the field of AI. As 2022 comes to an end, we’d like to share some of our top blogs that we posted this year, in addition to some articles that we posted before 2022 that are still resonating with our audience.
Coders can’t use Python without an appropriate code editor, also known as an integrated development environment (IDE). Two of the most relevant Python IDEs today are PyCharm from JetBrains and VSCode (Visual Studio Code) from Microsoft. Considering how prevalent Python is, it makes sense that people come back to this article to make sure their first step into the world of Python is the right one for their needs.
Data visualization requires quality data just as much as any other project. Finding data visualization datasets can be frustrating, but these datasets offer excellent resources to support visualization projects of all kinds. Data visualization continues to be one of the most popular uses for data science and AI, so why bother making your own datasets when you can just use one of these?
Time series forecasting is a useful data science tool for helping people predict what will happen in the future based on historical, time-stamped information. Early this year, Google researchers explained how they developed and used the company’s Temporal Fusion Transformer (TFT) to achieve more progress with these types of predictions.
Web scraping or web harvesting requires a good tool to be undertaken efficiently. It involves data crawling, content fetching, searching, parsing, as well as data reformatting to make the collected data ready for analysis and presentation. It is important to use the right software and languages for web scraping for the job. While many people in our community go right to Python, web scraping goes beyond data science and AI so there are plenty of other languages that may be better suited for different industries.
Natural language processing (NLP) is one of the most practical AI fields today. This technology is the driving force behind chatbots, smart speakers, and spell-checkers, and it could go further. Many law firms have started to take note of NLP’s potential. The legal industry seems like a textbook use case for tools like NLP. It involves extensive hours of data-heavy, repetitive tasks with a slim margin for error. However, its complexity and the severe implications of mistakes in the field make it an intimidating prospect. This has been a big year for the use of AI in law, government, and ethical issues, and NLP is definitely going to be used heavily in it moving forward.
Machine Learning Operations (MLOps) is a very hot space within the already rapidly-accelerating growing AI market. The MLOps market alone is expected to grow to almost 4 billion by 2025. Given the already crowded space for AI and MLOps startups, we took a look at some of the top MLOps startups earlier this year and asked a question – what problem does their startup solve?
This article discusses the Ethical AI Database project (EAIDB), which seeks to generate another fundamental shift — from awareness of the challenges to the education of potential solutions — by spotlighting a nascent and otherwise opaque ecosystem of ethical AI startups that are geared towards shifting the arc of AI innovation towards ethical best practices, transparency, and accountability. EAIDB, developed in partnership with the Ethical AI Governance Group, presents an in-depth market study of a burgeoning ethical AI startup landscape geared towards adopting responsible AI development, deployment, and governance. It also identifies five key subcategories of startups, then discusses key trends and dynamics of ethical AI startups.
People worldwide know Python as the most used programming language to date. Major tech companies like Google, Amazon, Meta, Instagram, and Uber use Python for various applications. From web development to machine learning projects, Python is an essential tool in a data scientist’s kit. If you’re interested in programming or just starting with Python, continue reading to learn the essential Python libraries you should be aware of.
There’s no question that the world is becoming increasingly reliant on data and the criminal justice system is no exception. The justice system in the United States has used various data types and forms of data collection for years. For example, police departments, states, and the U.S. Department of Justice (DOJ) rely on data to generate and report statistics regarding a wide range of crimes.
This past summer, the major AI news story was about LaMDA, Google’s breakthrough conversation technology. It was beyond impressive, and even showed emotions that were eerily human-like. Questions arose of its sentience, with a (now former) Google engineer, Blake Lemoine, speaking out about it and questioning Google’s intentions with the AI.
Past popular blogs
When we post an article, it’s not meant to be forgotten about quickly. Most of our content is evergreen and timeless, providing insights and walkthroughs of frameworks and developments that will be used for years to come. Here are a few blogs that were posted before 2022 that are still loved by the community.
Thanks to the Internet of Things, smart cities, e-health, autonomous machines, and other innovations, time series datasets are being produced in even more massive quantities. It can be used for econometrics, trend detection, pattern recognition, predictions, and is an essential ingredient in statistics, machine learning, and even deep learning models.
Learning time-series techniques will become increasingly important to any serious data scientist or machine learning engineer. Here are a few things to consider and some datasets to get you started.
A big challenge of working with data is manipulating its format for the analysis at hand. To make things a bit more difficult, the “proper format” can depend on what you are trying to analyze, meaning we have to know how to melt, pivot, and transpose our data.
In this article, we will discuss how to create a pivot table of aggregated data in order to make a stacked bar visualization of the 2019 airline market share for the top 10 destination cities. All the code for this analysis is available on GitHub here and can also be run using this Binder environment.
Why do we care if the data is skewed? If the response variable is skewed like in Kaggle’s House Prices Competition, the model will be trained on a much larger number of moderately priced homes, and will be less likely to successfully predict the price for the most expensive houses. The concept is the same as training a model on imbalanced categorical classes. If the values of a certain independent variable (feature) are skewed, depending on the model, skewness may violate model assumptions (e.g. logistic regression) or may impair the interpretation of feature importance.
So what is pruning in machine learning? Pruning is the process of removing weight connections in a network to increase inference speed and decrease model storage size. In general, neural networks are very over-parameterized. Pruning a network can be thought of as removing unused parameters from the overparameterized network.
Banking institutions need to use big data to remodel customer segmentation into a solution that works better for the industry and its customers. Basic customer segmentation generalizes customer wants and needs without addressing any of their pain points. Big data allows the banking industry to create individualized customer profiles that help decrease the pains and gaps between bankers and their clients. Big data analytics allows banks to examine large sets of data to find patterns in customer behavior and preferences.
Machine learning is exploding into the world of healthcare. When we talk about the ways ML will revolutionize certain fields, healthcare is always one of the top areas seeing huge strides, thanks to the processing and learning power of machines. There’s a good chance you either are or will soon be employed in the healthcare field. A while back, we wrote a list of 25 excellent open datasets for ML and included healthdata.gov and MIMIC Critical Care Database. Here are 15 more excellent datasets specifically for healthcare.
How to make next year’s list
There you have it, the top OpenDataScience blogs from 2022! How many of them did you read? Let us know what some of your favorites from the year are!
If you want to make next year’s list, then learn more about our guest contributor process here and send us an article to post! If you’re not ready to start writing, then you can start learning more in the meantime, possibly with an Ai+ Training subscription or an ODSC East 2023 ticket, which is currently 70% off.