3 minute read

Methods of Video Annotation

If a picture says a thousand words, a video says a million. Computer vision models that train on annotated videos acquire greater context than those that learn from images alone. The rich content preserved in video offers both a menagerie of objects for models to identify and the opportunity to discover connections among those objects. The trick is how to capture it. When an organization applies video annotation appropriately, they are able to speed up their time to production and differentiate themselves from their competitors at a faster and more efficient rate. 

Video annotation is a complex challenge because of how much information a video can contain. At 24 frames per second, the scale is massive! That’s why many default to the simplistic Single Image Method rather than the more complex Continuous Frame Method for annotating video.

Read Full Paper

Tackling Video Annotation 

While video content can supply incredibly rich information for a model to learn from, the intricate process of annotating these datasets also reflects this complexity. Programming in relationships demands a clear judgement framework, use case specific workflows, and complex architecture. There are two ways of thinking about processing a video which lead to very different results. The first, is the Single Image Method, which processes video as a collection of single images. The second, the Continuous Frame Method, understands videos as 4 dimensional, with 3 dimensional entities moving through time, and annotates them as such.

 

The Single Image Method

The single image method reduces the video into a series of hundreds or thousands of image frames, and then annotates image by image, frame by frame. At first blush, this makes sense but in reality it creates a tremendous amount of inefficiencies. The time to complete the labeling and annotation becomes longer, and the cost increases when trying to solve frame by frame, even when sequence and order is maintained. It also opens risk for lesser quality - as objects and transitory states may be mislabeled in one frame, that is the predecessor for another status change. 

Frame 2      Frame 3

 

The Continuous Frame Method

The Continuous Video Method annotates videos as a stream of frames, preserving the continuity and integrity of the flow of information captured. By annotating video with this method, we are able to maintain persistence of something being classified as the same instance across the entire video. 

marketing_video_4entities

 

Key Advantages

One of the key advantages of video over single images is the ability to represent the transitory state of each instance of an object or person. This allows the same entity to be recognized even if it goes in and out of view. Reducing a video down to a series of images, on the other hand, creates a duplication of effort when labeling and identifying objects that remain constant. 

Example

For example, imagine a person, labeled as “Person 1,” enters the view on frame 37 and remains in the picture for 100 frames. On frame 138 they are no longer visible but they return to view in frame 330. Without context of the previous frames, this could be labeled as a new person but in reality, it should be once again referenced as “Person 1.” 

While this method is much more technically challenging on the data labeling provider’s end, the payoff for the customer is substantial. The Continuous Frame Method not only saves time and resources, it also solves for three often overlooked aspects: entity persistence, detecting state change, and temporal tagging. Learn about these three aspects in our latest article on How To Tackle Video Annotation

Conclusion

Annotated videos hold the future for training computer vision models. Only an enterprise-grade labeling platform that employs The Continuous Frame Method can scale video annotation without diminishing complexity and compromising accuracy. With it your company can speed up time to value, build differentiated features, and mitigate the risk of poor performing models resulting from low-quality data.

Read Full Paper