White Papers

In Defense of Humans - the Fact vs. Fiction of Labeling Automation

Download WhitePaper

In Defense of Humans - the Fact vs. Fiction of Labeling Automation


Picture this - you’re sitting around a large conference table or clustered with other faces on a crowded Zoom call. It’s an exploratory meeting between a data labeling company and a potential customer so the conversation is centered on high level expectations and needs. What does this opening conversation typically look like?

In our experience, these conversations often involve some myth-debunking about the degree to which labeling data can be automated.

It makes sense to want to automate the labeling process as much as possible. Using human judgements seems expensive and reducing costs for a multi-stage model development project is highly desirable. However, in our experience, an overly automated labeling process can result in inference and localization errors that are far more expensive and inefficient to correct after the fact.

In this piece, we hope to address some common myths and questions about labeling automation:

  • Why can’t your ML platform automatically label my data?
  • Can’t you use labeled data from the same use case to automate my data labeling?
  • What role do humans play in the labeling process?
  • Isn’t automation more efficient?
  • What types of automation do exist?
  • Are there industries/use cases where significant automation is possible?

Alegion operates at the intersection of machine and human intelligence - we are one of the few companies both building the tools for annotation and developing and managing a skilled workforce. In this piece, we hope to show you how human involvement in your data labeling ensures that you hit the sweet spot of quality, efficiency, cost-reduction, and scalability.


Why Is Automation Difficult? Because Data Quantity and Quality Matters

Automation works best for basic pre-labeling tasks, very specific industries with simple annotation needs, and for companies who are operating on a huge scale with vast datasets. However, you lose quality when you turn something over to an automated, scaled solution, and this creates inefficiencies down the line. There is an accuracy range you want to hit when scaling your training data, and it takes human insertion to keep things in that range.

For most industries and companies, focusing on data quality will prove more efficient and cost-effective in the long run than focusing on automating the labeling process.


Why Humans Are Needed for Complex Annotation

Labeling data is a lot more complicated than just sticking bounding boxes around objects of interest, and bad inferences are more expensive than no inferences.

Most of our customers need complex annotations. This typically involves multiple entity types, each requiring multiple classifications. Let's look at human monitoring as an example, a use case we see across many different industries. Examples include driver monitoring, retail security, hospital patient monitoring, premise security - the list goes on and on. These tasks involve annotating a video and localizing and categorizing the people and activity within.

A typical task may include localizing the person with a bounding box, keypoints for facial features, as well as making many subclassifications on each entity. It's common to have scene classifications that apply to the environment. Let's take driver monitoring as an example. In the example below...

- bbox-person
- keypoint-left-eye
- keypoint-right-eye
- wearing-glasses=false
- in-shadow=false
- face-occluded=false
- alertness=4
Scene classifications
- light-change=false
- video-quality=3
- blink event=[758-764]

In order to fully pre-label this task, we would need a separate ML model for each localization and classification.

  • Localizations (2) - object detection (person), object detection(eye keypoints)
  • Classifications (6) - wearing glasses, face in shadow, face occlusion, alertness, light change, video quality
  • Scene detection (1) - blink event

Some of these may be straightforward such as the person and eye detectors. It's possible to use an existing model or train on completed annotations and get the initial inference where it's acceptable, but that still leaves many other models to be managed. Each of the models may have varying levels of performance that have to be observed, evaluated, and corrected. 

Evaluating the output (and performance) of the models requires human involvement and judgement. If a model performs poorly, the annotator needs to resolve noisy or poor labels.


Building Ground Truth Datasets Requires a Higher Degree of Accuracy

We are in the business of building ground truth datasets. The accuracy required is almost always much higher than what would be acceptable from an inference in production. Our customers require very high localization accuracy, and everything is measured. Our quality standards start at these (very high) defaults and can go much higher depending on the customer needs:

  • 0.8 IOU if localization guidelines 
    and object edges are clear(distinct, visible boundaries of a solid, physical object, e.g. no liquids/gas)
  • 0.65 IOU if localization guidelines are subjective or ROI target has ambiguous edges (e.g. bbox around a shoulder, blurry objects)
  • 0.65 IOU if localization area is < 100 pixels
  • Keypoint: 8 pixel tolerance - 90% accuracy 
  • Polyline: RMSE of 10 - 90% accuracy 

Localization accuracy at this level is very difficult to achieve unless you have trained on the dataset at hand. What works well in lieu of pre-labeling is to utilize ML-powered, classless segmentation selection aids. Algorithms such as DEXTR and its derivatives can provide very accurate segmentations and localizations that can be touched up by the annotator.

Question #1

Why can’t your ML platform automatically label my data?

This is a common question we get, and it makes sense. Many customers want to simply upload a dataset and let the platform get to work!

Here’s the thing - our team excels at providing you with the annotation platform and/or quality, labeled data you need to train a model, but we don’t have a model or algorithm that exactly fits your use case already built into our platform. If we did, we would simply sell you the model or algorithm. Our goal is to get you the ground truth data you need to build a powerful, production ready model.

Question #2

Can’t you use labeled data from a similar use case to automate my data labeling?

There’s a simple answer to this good question: very few use cases are similar enough for this to work. Even small differences in annotation guidelines or camera perspectives can hinder model performance and require significant clean-up. 

There are some use cases where automated labeling makes a lot of sense. Self-driving cars, medical imagery, and package handling are several examples where annotation requirements can be similar enough to make labeled data useful across projects and models.

However, if you have an existing model that is already working, we can absolutely work with these to pre-label and automate some aspects of producing new sets of labeled data for your use case.

Question #3

What role do humans play in the labeling process?

At Alegion, we choose not to compromise on quality, so there is human involvement and direction at every stage of the labeling process. For an in-depth look at our process for working with customers to produce high quality data, check out our full-length piece on this topic here.

Question #4

Isn’t automation more efficient?

The desire for automation is really just a desire for efficiency. However, even the best algorithm will fail to accurately identify and label edge cases, and those edge cases are the difference between a model that succeeds in production and a model that fails.

Human judgement and human intervention in the data labeling process can prevent those demoralizing inefficiencies down the road.

Question #5

What types of automation do exist?

For simple localizations, it's common to use an object detection algorithm to pre-label data. If you are counting drinks on a shelf in a retail setting, dropping a keypoint on every can or bottle doesn't require a high degree of accuracy. For anything that requires more precision - think a polygon localization of a scalpel in a surgical setting - we've found (through extensive testing) that it usually takes longer to correct a localization error from automatic labeling than for a human annotator to create a new annotation. 

Object tracking is often used in video use cases where an object needs to be localized over time. We use various "classless" algorithms that don't require training on specific objects and that work on a variety of use cases. Again, these algorithms do a good job of rough localization but usually require some correction.

Ingesting customer-provided (pre-)labels using existing detectors and classifiers can provide a reliable means to lower annotation effort assuming they are producing good inferences.

Automatic instance segmentation (e.g. DEXTR) can provide high-quality polygons and masks when there are well defined or clear boundaries between objects or the background.

Question #6

Are there industries/use cases where significant automation is possible?

The type and complexity of annotation work today is much different than several years ago - the easy problems are already solved. For industries where the annotation needs are fairly homogeneous, high quality pre-labeling modes can automate much of the labeling process, but if the customer is not in these few, very specific industries like healthcare or autonomous vehicles, it's difficult to use ML for the majority of the labeling process. 


Industry Titans Are Talking about Data Quality, Not Automation

We’re not alone in our emphasis on data quality. Increasingly, others in the industry are encouraging a collective focus on data-centric AI. Improving the quality of data used to power models is the key to bridging the gap between proof of concept and production success.

A team of researchers at Google recently released a paper titled - “Everyone wants to do the model work, not the data work”: Data Cascades in High Stakes AI. Their abstract highlights the problem beautifully:

Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations. Paradoxically, data is the most under-valued and de-glamorised aspect of AI. In this paper...We define, identify, and present empirical evidence on Data Cascades—compounding events causing negative, downstream effects from data issues—triggered by conventional AI/ML practices that undervalue data quality. Data cascades are pervasive (92% prevalence), invisible, delayed, but often avoidable.

Andrew Ng, Founder and CEO of LandingAI, recently declared “Data is the food for AI.” Ng is advocating for the shift from model-centrec to data-centric AI, and has announced a competition that asks AI practitioners to improve datasets for a fixed model.

Tesla, recognizing their need for high quality training data, has begun to grow their own in-house team of hundreds of human annotators. As Elon Musk put it in a Twitter thread from September 2020, “The machine that makes the machine is vastly harder [to make] than the machine itself.”

AI teams and companies need to focus on acquiring the quality data they need to build high performing models, and then scaling that data pipeline without losing quality. This requires keeping human intelligence at the forefront of your labeling project. 

Re-focusing on data quality does not mean automation conversations cannot or should not continue. In a future piece, we will explore how we see the opportunities for increased automation unfolding, including an examination of things like multi-task models.



The wise data science team understands the necessity of human involvement in order to maximize the power and potential of machine learning platforms and algorithms. While some niche use cases can succeed with a high degree of automation, the vast majority of ML projects need human involvement at every stage in order to keep data quality high and minimize inefficiencies due to automation error.

The capabilities of machine learning (ML) and artificial intelligence (AI) platforms are constantly evolving, and we spend a lot of time planning, building, and iterating toward a more automated future. However, this forward looking approach can obscure what is fact and what is fiction when it comes to automation and the role it plays in model development today.

We hope this explainer has helped you understand better both the limitations and the extraordinary power of machine learning platforms. 

Do you want to work with data labeling pros who specialize in optimizing both human and machine intelligence? Contact us today for consultation at solutions@alegion.com.

Interested in learning more about our solution?

Request a Demo