Test-Driving Apache Superset
DataOpsDS @ ScalePlatformsTools & LanguagesApache|Data analysis|Open Data Scienceposted by SmartCat Data Science Team March 6, 2018 SmartCat Data Science Team
The Curious Case of Algo-Trading Dashboard
For one of our recent internal projects, we needed a quick and easy way to showcase some first insights, do some plotting and interactive storytelling with the data. We also wanted to build a live, working dashboard in front of a (future) product – to serve as a demo. The project in question was Bitcoin Trade signals and you should soon be able to read more about it in a separate blog post. For the purpose of this test-drive I took just one component of the whole project – Trading dashboard, and on it I’ll point to some cool features of Apache Superset and also to some peculiarities we encountered.
Superset is Airbnb’s data exploration platform: modern, intuitive, open sourced, built on top of Flask and SQLAlchemy and therefore covering many possible data sources. One can also read about it as an enterprise-ready and scalable BI solution, but testing that particular claim is for another story and different use-case. Our algo-trading project is small-scale, going through several steps:
- collecting data hourly or daily from various sources into a simple SQLite database,
- doing some predictive analytics (daily and/or weekly), and finally
- using predictions as signals for algorithmic trading of, what else these days than – Bitcoins.
To schedule data collection and training of models, we use another Airbnb’s tool – Apache Airflow (one could argue that we are somewhat biased toward Airbnb-powered open-source tools). In the final setup, architecture consists of 3 components: dockerized models/trading scripts, dockerized Airflow version 1.9.0, with custom config and DAGs and dockerized Superset version 0.22.1. Existing docker images for Superset lack full automation for admin user creation and database initialization, so we changed Dockerfile and entrypoint scripts a bit. This setup is flexible enough so feel free to use it in other cases that you find fit. The scaffold of this setup that we abstracted away from the project could be found on github.
If you follow the instructions from the repo, in just a few steps you should have containers running with Airflow and Superset GUI on ports specified in docker-compose.yml. The application flow goes like in the picture below:
- Airflow DAG from its own container starts some python scripts in the trader container,
- SQLite DB is mounted to the trader container and populated when new data comes, existing strategies are backtested or new trades are done,
- DB is also mounted to and accessed from inside the Superset container and your Dashboard is updated, live.
Simple and pretty neat – I could imagine making many similar demos, in different domains, running on this same setup.
From DBs to Tables to Slices
When you have your database in place and imported in Superset, you can start playing around. Superset comes with powerful SQL Lab to make all sorts of queries on your newly imported DB, but in order to actually do that you’ll need to edit DB settings and tick the Expose in SQL Laboption and/or Allow DML option. Next, you’ll expect to be ready to make parts of dashboards called slices, which could be charts, tables, static parts, iframes, etc. However, in order to make slices for dashboard(s) you’ll need to add specific tables on which you could work. This was the first unintuitive situation with Superset – the fact that slices worked on single tables only.
Adding the tables is easy though, and modifying settings for their columns is recommended! This way you obtain fine control of which columns you want to be groupable, filterable, which aggregations are allowed on which columns, etc. Slices are just a step away, but first let’s dig a bit deeper into this unintuitive part that I mentioned.
What if we want to visualize on some JOIN between tables? Is that supported? Well, this has actually been asked a couple of times here-and-there and is currently the first item in the FAQ. The provided answer and possible solutions are not that informative, so here are 2 straightforward solutions:
- Use table edit mode to add additional metrics.
In table edit mode you could set additional composite metrics from your existing columns but also construct them by JOIN-ing other tables’ columns with a custom SQL query.
- Create a VIEW in SQL Lab and visualize on it.
If you followed closely, you’ll remember that in the lab you could do all sorts of queries (even DML ones) so some JOINs are supported there too. When you click on visualize on the JOIN result, View gets automatically saved in the list of tables as if it were a regular table in DB. One possible downside is that exported/pickled Dashboards with this type of VIEWs could become non-portable – I have had this issue but this is not a confirmed bug.
(Some specific) Slices
There is a fair amount of possibilities between types of charts, tables, static and interactive blocks and even maps with custom data that could be made with Superset. I won’t go through all of them here – just the ones that turned out challenging for Trader dashboard:
Charting challenge 1: Visualize Backtesting results of (many) algo-trading strategies
In algorithmic trading, backtesting is a technique of replaying historical situations in order to test an algorithm. We did this with pyalgotrade in an event-driven fashion for many parameters’ combinations and saved the results in a database (every backtesting run creates one table). Soon after the number of backtesting runs passed over a dozen (and later even a hundred), with 150-1000 or more records each, I knew my data model was not optimal for comparing them on charts. So I joined the tables! No, not in SQL Lab, but by changing the data model. This was necessary since simply joining the tables once wouldn’t resolve the problem when we do more and more backtesting runs. The resulting table was of course larger, but easy to query and do GROUP BY on specific backtesting run. I also used Superset’s Filter Box widget to provide some nice controls and the resulting UX was quite pleasing. Upon changing the selection every slice was updated with corresponding FILTER-ed data.
Result: After the data model change, making the imagined charts and desired interactivity in Superset was very easy and intuitive!
Charting challenge 2: Summarize improvement in return by one trading strategy over some other
Apart from parameter optimization for a specific trading strategy, I also wanted to compare backtesting results between different strategies over the same periods of time and plot the trendline of difference in return. In order to do this, I needed summary statistics over some FILTER table views matched between strategies. Some very complicated queries emerged and I finally decided to make a new table (not a SQL Lab view) with these statistics separately calculated.
Result: Not much of a fault of Superset, but this simple chart turned out to be very complicated. A newly derived table solved the problem so in this battle it was Pandas vs. SQL (1:0) and Superset was able to produce one very lightweight trendline with a simple query behind it.
Charting challenge 3: Candlestick chart
This one I knew I couldn’t make properly with current Superset’s tools – since candlestick charts represent time series of financial asset prices that are heavily annotated. Candles are bars made from Open/Close/High/Low market prices for each day (Figure below). Superset 0.22.1, a stable version that we tested initially, didn’t have any annotation capabilities. But the next minor version, 0.23.0rc4, had promising annotation types and we wanted to test that on this case even before a stable release. The annotation types available are: Time Series, Event, Region and Formula, but unfortunately they only annotate along the series line or whole rectangular areas on the charts – no bars on specific events, single points or similar. Until more complex annotations become available in Superset, one option to make very customized graphs could be making a separate page, alongside Superset, as Flask Blueprint and then, for plotting a candlestick one can use matplotlib’s candlestick functions.
Superset is a very nice and promising BI tool. It is still incubating and will definitely get much better in the future. Since the focus of this testing was on tooling, I would conclude that making charts is really intuitive, with a major lacking feature right now being annotations and custom, complex charts. Other issues I stumbled upon turned out to be solvable by adjusting the data model or thinking ahead when designing it.
As for next steps – we will definitely continue working in Superset and, as mentioned, we are planning to run it in cluster mode, querying some giant databases with millions of rows, load balancers and all. This test-drive went quite well, let’s see how it behaves out on the open road.
Blog post by Dusan Randjelovic, Data Scientist at SmartCat
Dusan is a Python and R developer and once-upon-a-time Ruby fan. Worked on web and mobile, but also on streaming-data IoT projects: as an entrepreneur, developer and sometimes IT architect. Astrophysicist turned Bioinformatics Analyst turned Data Science generalist. Enjoys how domain-specific problem trigger his curiosity and problem-solving drive.
SmartCat Data Science Team
We solve problems in data-intensive environments. We are a carefully selected team of industry seniors with highly set goals. We are here to deliver outstanding data management, visualization and analysis service and remarkable technical solutions that will empower decision makers around the globe. This is not our job. These are our lives. Knowledge is our thirst and dedication is our lifestyle. Our area of expertise is Data Science, Data Engineering and DevOps and more precisely combination of those three services together so we can build smart solutions with our clients which are production ready.
Data Validation at Scale – Detecting and Responding to Data Misbehavior
Europe 2023Modelingposted by ODSC Community Jun 2, 2023
AI Girlfriends and Other Ridiculous Examples of Using Generative AI
Generative AIposted by ODSC Team Jun 2, 2023
Paralyzed Man Walks Again Thanks To AI-Powered Tool
AI and Data Science Newsposted by ODSC Team Jun 2, 2023