Firing on All Cylinders: The 2017 Big Data Landscape, part 2
Data WranglingDS @ ScaleModelingTools & Languagesposted by Matt Turck October 8, 2017 Matt Turck
A walk through the 2017 Data Ecosystem Landscape
A lot of themes from last year have continued to play out, such as the ever-increasing importance of streaming, with Spark reigning supreme for now, with interesting contenders such as Flink emerging. In addition, a few interesting themes have kept coming back in conversations:
It’s now official, SQL is back
After playing second fiddle to NoSQL technologies for the last decade, SQL database technologies are now back. Google recently released a cloud version of its Spanner database. Both Spanner and CockroachDB (an open source version of Spanner) offer the promise of a survivable, strongly consistent, scale-out SQL database. Amazon released Athena which, like other products like Snowflake, is a giant SQL data engine that queries directly on data held in S3 buckets. Google BigQuery, SparkSQL and Presto are gaining adoption in the enterprise – all SQL products.
An interesting trend related to public cloud adoption is the rapid rise of data virtualization. Whereas older ETL processes required moving massive amounts of data (and often creating duplicates of data sets) and creating data warehouses, data virtualization enables companies to run analytics on data while leaving it where it is located, increasing speed and agility. Many of the next generation analytics vendors now offer both data virtualization and data preparation, while enabling their customer to access data stored in the cloud.
Data Governance & Security
As Big Data in the enterprise matures, and the variety and volume of data keeps increasing, themes like data governance become increasingly important. Many companies have chosen a “data lake” approach which involves creating a central repository where all data gets dumped. A data lake is of no use unless people know what is in it, and can access the right data to run analytics. However, enabling users to easily find what they need, while managing permissions is tricky. Beyond data lakes, a central theme of governance is to provide easy access to trustworthy data to anyone that needs it in the enterprise, in a secure and auditable way. Vendors large and small (Informatica, Collibra, Alation) provide data catalogs, reference data management, data dictionaries and data help desks.
Are data scientists an endangered species?
Barely a few years ago, data science was dubbed the “Sexiest profession of the 21st century”. And “data scientist” is still ranked #1 in the Glassdoor’s list of “Best Jobs in America”.
But the profession, just a few years after it appeared, may now be under siege. Part of it is due to necessity – while schools and programs are cranking out legions of new data scientists, there are just not enough of them around, particularly in Fortune 1000 companies that arguably have a harder time recruiting top technical talent. In some organizations, data science departments are evolving from enablers to bottlenecks.
In parallel, the democratization of AI, and the proliferation of self-serve tools is making it easier for data engineers with limited data science skills, or even non-technical data analysts, to perform some of the basic functions that were up to recently the territory of data scientists. Chunks of Big Data in the enterprise, especially the grunt work, may be increasingly handled by data engineers and data analysts with the help of automated tools, rather than actual data scientists with deep technical skills.
That is, unless data science doesn’t end up being handled entirely by machines. Some startups are explicitly positioning their offering as “automating data science” – most notably, DataRobot just raised $54M with that goal in mind (How Data Science is Automating Itself), and Salesforce Einstein claims that it too can generate models automatically.
Not surprisingly, those trends are unpopular and controversial within the data science community. However, data scientists probably don’t have much to fear just yet. For the foreseeable future, self-service tools and automated model selection will “augment” rather than disintermediate data scientists, freeing them to focus on tasks that require judgment, creativity, social skills or vertical industry knowledge.
Making everything work together: The rise of the data workbench
In most large enterprises, the adoption of Big Data started with a few isolated projects (a Hadoop cluster here, some analytical tool there) and some new job titles (data scientists, Chief Data officers).
Fast forward to today: heterogeneity has grown, with a variety of tools used across the enterprise. Organizationally, the centralized “data science department” is giving way to a more decentralized organization in large companies, with cross-functional groups of data scientists, data engineers and data analysts, increasingly embedded in different business units. As a result, the need has become very clear for platforms to make everything and everyone work together – as mentioned in our post last year, success in Big Data is based on creating an assembly line of technologies, people and processes.
As a result, a whole category collaborative platforms is now accelerating rapidly, pioneering a field that some call DataOps (in relation to DevOps). This is the thesis behind FirstMark’s investment in Dataiku (see my previous post, Dataiku or the Early Maturity of Big Data). Other notable financings in the space include Knime ($20M Series A) and Domino Data Lab ($10M Series A). Cloudera just released a workbench product based on its earlier acquisition of Sense. Open source activity in this segment is also strong, including for example Jupyter and Anaconda.
AI-powered vertical applications
We’ve been talking about the rise of vertical AI applications for a couple of years at least (x.ai and the Emergence of the AI-powered Application), but what started as a trickle has now morphed into a torrent. Everyone suddenly seems to be building AI applications – both new startups and later-stage startups betting on AI for their next growth spurt (InsideSales, for example).
As tends to the case in such circumstances, there’s a combination of genuinely exciting new startups and technologies, and some smoke and mirrors, as many companies furiously rebrand to chase the latest buzzword. Anyone that uses some machine learning somewhere is not an AI company.
On the whole, building an AI startup is tricky. Picking a vertical problem is certainly an important start. Beyond a deeply technical DNA, it requires some thoughtful positioning and tactics (Building an AI Startup: Realities and Tactics)
However, it’s hard not to be fascinated by the possibilities, and impressed by the velocity.
In the last year in particular, the clear trend has been to take any data problem, and apply AI to it. This has been the case across both enterprise applications and industry verticals. To reflect that reality, this year we’ve added a number of categories in the Applications section of the landscape, including Transportation, Real Estate (Modernizing Real Estate with Data Science), and Insurance, and split up in two categories areas with particularly strong activity, such as Marketing applications (now B2B and B2C) and Life Sciences (now Healthcare and Life Sciences)
Beyond the areas that still feel somewhat futuristic (like self-driving cars), AI today shines in more pedestrian enterprise categories, delivering tangible results in anything from churn prediction to back office automation to security.
Losing human jobs to AI may not even be on the new US administration’s radar screen, but no profession is immune to thinking how it may be, at a minimum, “augmented” by AI. This includes some of the most established white-collar professions such as doctors (A.I. vs M.D.) or lawyers (A.I. Is Doing Legal Work).
The finance world, in particular, seems to have been thinking about AI a lot. Hedge funds, after a couple of tough years, are on the hunt for alternative data, often to feed it to algorithms (The New Gold Rush? Wall Street Wants your Data). New AI powered hedge funds (Numerai, Data Capital Management, etc), while early, are gaining traction. Some of the most prominent firms on Wall Street are increasingly using AI over human employees (BlackRock, Goldman Sachs).
Love them or hate them, 2016 was the year of bots – fully automated, real time conversational agents that live mostly on messaging services. In their short existence, bots seem to have gone through several hype cycles already, from early promise, to Tay disaster, to mini-renaissance, to Facebook’s scaled-back efforts following reports of 70% failure rates for AI bots running on its Messenger platform.
There are many reasons why the excitement over bots may have been premature – see Bradford Cross’ excellent take here, where he rightly points out that people may have derived over-optimistic signals from the rise of bots in Asia or the rapid growth of underlying infrastructure such as Slack. Ultimately, we believe that bots have tremendous potential but, as always, the space needs a lot more time. A significant expectation adjustment needs to occur on both the “producer” side (startups need to focus on very narrow business problems and promise less) and the “consumer” side (we all need to get used to what bots can and cannot do, which Alexa is singlehandedly training us to do!).
For now, the brightest future probably belongs to services that include significant elements of humans in the loop, or actually position away from bots entirely and use AI to augment the capabilities of human agents (the thesis behind our investment in frame.ai).
With the killer combination of Big Data and AI, we’re heading towards the “harvesting” part of the cycle. Beyond all the hype, the possibilities are enormous.
As core infrastructure continues to mature, and the application side, powered by AI, is bursting with activity, in 2017 the Big Data (and AI) ecosystem is firing on all cylinders.
1) This year more than ever, we couldn’t possibly fit all companies we wanted on the chart. While the general philosophy of the chart is to be as inclusive as possible, we ended up having to be somewhat selective. Our methodology is certainly imperfect, but in a nutshell, here are the main criteria:
- Everything being equal, we gave priority to companies that have reached some level of market significance. This is a reasonably easy exercise for large tech companies. For growing startups, considering the limited amounts of data available, we often used venture capital financings as a proxy for underlying market traction (again, probably imperfect). So everything else being equal, we tend to feature startups that have raised larger amounts, typically Series A and beyond.
- Occasionally, we made editorial decisions to include earlier stage startups when we thought they were particularly interesting.
- On the application front, we gave priority to companies that explicitly leverage Big Data, machine learning and AI as a key component or differentiator of their offering. As discussed in the piece, it is a tricky exercise at a time when companies are increasingly crafting their marketing around an AI message, but we did our best.
- This year as in previous years, we removed a number of companies. One key reason for removal is that the company was acquired, and not run by the acquirer as an independent company.. In some select cases, we left the acquired company as is in the chart when we felt that the brand would be preserved as a reasonably separate offering from that of the acquiring company.
2) As always, it is inevitable that we inadvertently missed some great companies in the process of putting this chart together. Did we miss yours? Feel free to add thoughts and suggestions in the comments.
3) The chart is in png format, which should preserve overall quality when zooming, etc.
4) As we get a lot of requests every year: feel free to use the chart in books, conferences, presentations, etc – two obvious asks: (i) do not alter/edit the chart and (ii) please provide clear attribution (Matt Turck, Jim Hao and FirstMark Capital).
5) Disclaimer: I’m an investor through FirstMark in a number of companies mentioned on this Big Data Landscape, specifically: ActionIQ, Cockroach Labs, Dataiku, Frame.ai, Helium, HyperScience, Kinsa, Sense360 and x.ai. Other FirstMark portfolio companies mentioned on this chart include Bluecore, Engagio, HowGood, Payoff, Knewton, Insikt, Optimus Ride, and Tubular. I’m a very small personal shareholder in Datadog.
6) List of acquired companies since the last version of the Big Data landscape:
Target / Acquirer / Amount (if disclosed)
2017 YTD (5)
- Mobileye / Intel / $15.3B
- AppDynamics / Cisco / $3.7B
- Nimble Storage / HPE / $1.1B
- Kaggle / Google
- Dextro / Taser
- Qlik / Thoma Bravo / $3B
- Cruise Automation / General Motors / $1B
- Apigee / Google / $625M
- OPower / Oracle / $532M
- Tapad / Telenor / $360M
- Nervana Systems / Intel / $350M
- SwiftKey / Microsoft / $250M
- Withings / Nokia / $191M
- Circulate / Acxiom (LiveRamp) / $140M
- Altiscale / SAP / $125M
- Viv Labs / Samsung / $100M
- Connectifier / LinkedIn / $100M
- Recombine / Cooper / $85M
- MetaMind / Salesforce / $32.8M
- Livefyre / Adobe
- TempoIQ / Avant
- DataHero / Cloudability
- Sense / Cloudera
- io / GE
- ai / Google
- EagleEye Analytics / Guidewire
- Attensity / inContact
- RJMetrics / Magento Commerce
- Placemeter / Netgear
- Kimono Labs / Palantir
- Tute Genomics / PierianDx
- Statwing / Qualtrics
- PredictionIO / Salesforce
- Roambi / SAP
- Visually / Scribble Technologies
- Preact / Spotify
- Nuevora / Sutherland Global Services
- Geometric Intelligence / Uber
- Platfora / Workday
- Driven / Xplenty
- Gild / Citadel