ODSC – Open Data Science Conference – hosts AI practitioners from across the northeast. They offer technical sessions on an array of topics ranging from data wrangling to machine learning to predictive analytics.
What we learned:
This space is maturing
We chatted with reps across industries including healthcare, public safety, and insurance who are all finding innovative ways to use ML. Their issues are familiar to us:
- Not enough labeled data
- Bias or errors in the data
- Their data scientists’ time has been absorbed by the tedious task of annotating
What’s new, however, is where these companies are in the project lifecycle. Many of the folks we talked with have successfully cleared the POC stage, which is a big win. But shortly after they popped the champagne bottles in celebration, the next larger hurdle presented itself: the long road to model confidence.
See POCs don’t really require that much data. In fact, a lot of projects are able to pass the POC stage with off-the-shelf training data solutions. But to scale and achieve enough accuracy to bring it to market requires exponential amounts of training data.
& the concerns behind them
Another indication of the maturation of this space is the quality of the questions we were fielding showing concerns about quality, speed, cost, and security as well as a strong preference for working with vendors who offer more than just a workforce.
- Are you just people? BPOs - Business Process Outsourcing - see new opportunities in the growing demand for training data labeling and annotation. But we get asked this question because data scientists have figured out that human judgment, while critical, is far from all that is required to prepare high-quality training data at scale..
- Do you have your own technology platform? This is a corollary to the first question. So you’ve got people - what else? Do you have a secure technology platform through which the work can meet the people? Do you have mechanisms in place for automating quality? Can you track worker performance and audit everything that is done to the data? Do you have project managers to design the tasks? These are all essential if you want to completely offload your data labeling..
- Does your platform include tools for labeling computer vision/NLP data? ML disciplines like computer vision and NLP embrace an enormous variety of use cases, many of which require highly specialized labeling and annotation tooling. Ever better and more precise tools are instrumental to increasing speed, accuracy, and efficiency. Numerous companies have come to us frustrated after having spent precious resources to create their own tools in house, only to find that they are not able to cut the mustard..
- Does your platform itself have AI in it? This again shows the precience for the superhuman amounts of labeled data that will be needed to feed ML algorithms in order to achieve ROI level accuracy, and better. In the ongoing pursuit of improved confidence, the volume of training data required grows geometrically. There quickly comes a point where human judgment isn’t practical or affordable, at which point a training data platform has to be able to teach itself.
AI practitioners are on the cutting edge of this technology and they want to work with their peers, ie a vendor who shares their mission to create technology capable of learning.
To learn more about what it takes to prepare your own ML training data check out our Blueprint.