…and then make better use of the data you get.
The usefulness of data science begins long before you collect the first data point. It can be used to describe very clearly your questions and your assumptions, and to analyze in a consistent manner what they imply. This is neither a simple exercise nor an academic one: informal approaches are notoriously bad at handling the interplay of complex probabilities, yet even the a priori knowledge embedded in personal experience and publicly available research, when properly organized and queried, can answer many questions that mass quantities of data, processed carelessly, wouldn’t be able to, as well as suggest what measurements should be attempted first, and what for.
The larger the gap between the complexity of a system and the existing data capture and analysis infrastructure, the more important it is to set up initial data-free (which doesn’t mean knowledge-free) formal models as a temporary bridge between both. Toy models are a good way to begin this approach; as the British statistician George E.P. Box wrote, all models are wrong, but some are useful (at least for a while, we might add, but that’s as much as we can ask of any tool).
Let’s say you’re evaluating an idea for a new network-like service for specialized peer-to-peer consulting that will have the possibility of monetizing a certain percentage of the interactions between users. You will, of course, capture all of the relevant information once the network is running — and there’s no substitute for real data — but that doesn’t mean you have to wait until then to start thinking about it as a data scientist, which in this context means probabilistically.
Note that the following numbers are wrong: it takes research, experience, and time to figure out useful guesses. What matters for the purposes of this post is describing the process, oversimplified as it will be.
You don’t know a priori how large the network will be after, say, one year, but you can look at other competitors, the size of the relevant market, and so on, and guess, not a number (“our network in one year will have a hundred thousand users”), but the relative likelihood of different values.