Editor’s note: Jeff Tao is a speaker for ODSC West 2023 this Fall. Be sure to check out his talk, “What is a Time-series Database and Why do I Need One?” there!
Most data scientists are familiar with the concept of time series data and work with it often. The time series database (TSDB), however, is still an underutilized tool in the data science community. Although setting up a database to run your analyses may seem like an arduous task, modern open-source time series databases can provide significant benefits to any scientist running time series analysis on a large data set – and with much less effort than you might imagine.
Typically, time series analysis is performed either on CSV files or data lakes. These may seem like simpler solutions than traditional databases because they can store essentially any type of data without needing a predefined schema. However, they make it harder to maintain the context of each data point – for example, the location of a data collector, the temperature at the time of collection, or a host of other elements that need to be preserved to ensure that your analysis is correct. Furthermore, the flexibility of data lakes in terms of how data is organized can have the undesired side effect of making that data difficult to query or filter.
A purpose-built time series database, on the other hand, can easily maintain this type of metadata in the form of tags or labels associated with each time series. Data cleansing and transformation also become easy tasks with a TSDB – for example, aligning the timestamps of multiple datasets can be quickly performed with interpolation or aggregation functions built into the database. And retrieving data is straightforward with a query language like SQL where you can filter by value, tag, time range, and more.
TDengine is an example of a time series database that simplifies the process of analyzing large-scale time series data so that data scientists can spend more time on their science. It quickly processes and stores massive datasets with high performance and scalability, and with a little knowledge of SQL you can manage your data much more conveniently than traditional CSV files. Most importantly, you can get started with TDengine in only 60 seconds, and its open-source edition is free to download and use.
A variety of time-series functions are included by default, such as cumulative sums, time-weighted averages, and moving averages, and you can also create user-defined functions (UDF) in Python or C. Support for popular Python ecosystem projects like pandas and Jupyter ensures that you can get your data in and out easily, and seamless integration with visualization tools like Grafana allows you to display your work in innovative ways and generate new insights.
If you would like to learn more about time series databases and how they can help you analyze time series data more efficiently, I encourage you to attend my upcoming session “What Is a Time Series Database and Why Do I Need One?” at ODSC West 2023. The session will include a sample code and a demonstration, after which I will be happy to answer any questions that you may have on the topic.
About the Author:
Jeff Tao is the founder and CEO of TDengine. He has a background as a technologist and serial entrepreneur, having previously conducted research and development on mobile Internet at Motorola and 3Com and established two successful tech startups. Foreseeing the explosive growth of time-series data generated by machines and sensors now taking place, he founded TDengine in May 2017 to develop a high-performance time-series database purpose-built for modern IoT and IIoT businesses.