Most readers of this blog already know that training machine learning algorithms requires huge amounts of annotated data. But a recent survey of data scientists we commissioned with Dimensional Research gave us a concrete definition of “huge”. A full 72% of the nearly 300 survey respondents reported that, in their current project, production-level model confidence will require more than 100,000 labeled data items. And 10% indicated they’d need more than 10 million! That’s a lot of bounding boxes and polygons!
Model confidence refers to how well a machine learning algorithm performs on unseen data:
“Confidence Intervals for Machine Learning. Much of machine learning involves estimating the performance of a machine learning algorithm on unseen data. ... They can be used to add a bounds or likelihood on a population parameter, such as a mean, estimated from a sample of independent observations from the population.”
Volume and Scale
Certainly, labeling volumes of data like these will overwhelm data science teams. Our first meeting with a team invariably comes after they’ve tried without success to get control of their training data preparation. They are running out of budget and time and data volume and accuracy is nowhere near what they need.
But when the required volumes get into the millions and tens of millions of data items, even throwing more people at the task of labeling data is insufficient. At some point human judgment on its own is too slow and too expensive. Machine judgment in the form of an ML-driven platform becomes mandatory.
In addition to this finding, our survey revealed a great deal more about the realities of machine learning projects in the enterprise. We’ve summarized the research and welcome you to download your own copy.