In this piece, one of our in-house research scientists, Cameron Wolfe, examines how the industry evaluates the quality of video-based annotations for computer vision applications and suggests a better pathway forward.
Object detection is now one of the most common applications in computer vision, especially as vanilla classification problems have become easier to solve with modern deep learning architectures. As object detection has become more deeply studied, parallel research has arisen on the problem of performing object detection in video. This issue is a hybrid of several commonly-studied computer vision problem (i.e., tracking, object detection, and classification), in which objects must be correctly localized, classified, and associated (i.e., the same object is identified as such in different video frames by a unique identifier) through relevant frames within a video.
See the complete white paper: Better Quantifying the Performance of Object Detection in Video.
Video object detection has several unique challenges that differentiate it from object detection within static images. In this post, I will not overview any of the current methodologies for video object detection (see here for a more comprehensive overview of this). Rather, I will discuss one of the main problems within the space that is currently hindering consistent, technical progress: the lack of a comprehensive metric that can evaluate the performance of object detectors in video. Without such a standardized metric for evaluating, comparison between different methodologies becomes difficult, leading researchers to question whether their contributions are truly valuable.
As a solution to this issue, I argue that higher order tracking accuracy (HOTA), a standardized metric within the object tracking community, is the correct metric for video object detection, so long as it is made classification-aware. I will begin the post by providing motivation for developing a better video object detection metric. Then, I will introduce a few relevant concepts and definitions related to object detection, followed by a discussion of current metrics that exist within the space. I will conclude the post with a discussion of the HOTA metric, emphasizing its suitability as an evaluation metric for video object detection.