White Papers

Better Quantifying the Performance of Object Detection in Video


Better Quantifying the Performance of Object Detection in Video

In this piece, one of our in-house research scientists, Cameron Wolfe, examines how the industry evaluates the quality of video-based annotations for computer vision applications and suggests a better pathway forward.

Object detection is now one of the most common applications in computer vision, especially as vanilla classification problems have become easier to solve with modern deep learning architectures. As object detection has become more deeply studied, parallel research has arisen on the problem of performing object detection in video. This issue is a hybrid of several commonly-studied computer vision problem (i.e., tracking, object detection, and classification), in which objects must be correctly localized, classified, and associated (i.e., the same object is identified as such in different video frames by a unique identifier) through relevant frames within a video.

Video object detection has several unique challenges that differentiate it from object detection within static images. In this post, I will not overview any of the current methodologies for video object detection (see here for a more comprehensive overview of this). Rather, I will discuss one of the main problems within the space that is currently hindering consistent, technical progress: the lack of a comprehensive metric that can evaluate the performance of object detectors in video. Without such a standardized metric for evaluating, comparison between different methodologies becomes difficult, leading researchers to question whether their contributions are truly valuable.

As a solution to this issue, I argue that higher order tracking accuracy (HOTA), a standardized metric within the object tracking community, is the correct metric for video object detection, so long as it is made classification-aware. I will begin the post by providing motivation for developing a better video object detection metric. Then, I will introduce a few relevant concepts and definitions related to object detection, followed by a discussion of current metrics that exist within the space. I will conclude the post with a discussion of the HOTA metric, emphasizing its suitability as an evaluation metric for video object detection.

Why do we need a better metric?

Put simply, the current state of metrics for video object detection is very poor. In particular, the common approach of researchers in this area is to evaluate their models separately on each frame of a video, then average the performance across all frames. The temporal aspect of video object detection performance is completely disregarded. Yet, in comparison to object detection in static images, the temporal aspect of video object detection is the key component that makes the problem difficult and interesting. 

Therefore, the inability of the most widely-used video object detection metrics to capture such temporal behavior is completely inexcusable. Without a common metric that better quantifies all aspects of video object detection performance, the community will not be able to make consistent, forward progress. To solve this issue, a unified metric (i.e., one metric, instead of a group of several metrics) must be standardized and adopted. Ideally, a proper metric for video object detection should provide a single, interpretable score (possibly decomposable into multiple, more specific sub-scores) that captures all relevant aspects of video object detection performance, including detection, localization, association, and classification.

Relevant Concepts

In this section, I will provide a brief discussion of a few useful concepts that are important for understanding object detection evaluation. In particular, I will describe the Hungarian algorithm (i.e., the most commonly-used algorithm in object detection for associating predicted objects with ground truth objects) and the mean average precision metric (i.e., the most widely-used evaluation metric for object detection in static images).

Hungarian Algorithm

Within any form of object detection, we are given during evaluation a set of ground truth and predicted objects. But, it is not immediately clear which of these predictions corresponds to which of the ground truths. Therefore, some work must be done to determine the best mapping between ground truth and predicted objects before the quality of predictions can be evaluated. More specifically, a one-to-one mapping between predicted and ground truth objects must be produced. However, it is also possible for predicted or ground truth objects to have no match (i.e., in such a case the mapping is not technically one-to-one, as there exists unpaired elements).

The Hungarian algorithm is a combinatorial optimization algorithm that can be used to solve the set matching problem in polynomial time. Within the object detection domain, it is commonly used to produce the mapping between predicted and ground truth objects, typically based on pairwise intersection over union (IoU) scores between objects. Additionally, some extra requirements are typically imposed for objects to be matched with each other within the Hungarian algorithm (e.g., IoU of two bounding box object detections must be greater than 0.5 to be a considered a viable match). The Hungarian algorithm is desirable for such an application due to its performance and efficiency.

Mean Average Precision

As previously stated, mean average precision (mAP) is the standard metric for evaluating object detectors in static images. In order to compute mAP, one must first compute the number of true positive/negatives and false positive/negatives within an image. After running the Hungarian algorithm, computing such metrics is quite simple (i.e., we just check if detections are missing a pair, pairs are wrong, etc.). From here, precision and recall can be computed at different “confidence” levels within an image to determine the average precision (AP). Such a process is repeated separately for every different semantic class in a given problem domain. Then, the mean of the average precisions is computed across all semantic classes, forming the mean average precision (mAP) metric.

For a more in-depth description of mAP, I recommend reading this article. However, understanding that (1) mAP is the go-to metric for evaluating object detection performance in static images and (2) mAP is computed by looking at prediction metrics within a single image should be sufficient for understanding the remainder of this post.

What metrics do we have already?

Before I introduce the metric that is (in my opinion) most appropriate for evaluating object detectors in video, it is worthwhile to consider some other metrics that could be used. I will divide this discussion into two categories based on whether metrics were used for object detection or object tracking and provide a brief, high-level description of each metric, aiming to demonstrate its shortfalls in the context of evaluating video object detection. First, recall that measuring the performance of object detection in video has four major components: detection, localization, association, and classification. As will be seen in this section, object tracking does not consider the classification component of evaluation, but classification-aware metrics for object tracking have the potential to capture all relevant components of video object detection performance.

Object Detection Metrics

As previously mentioned, the most-common metric for evaluating object detection in video (at the time of writing) is mAP. In particular, mAP is computed separately across each frame of a video and an average of mAP scores is taken across all video frames to form a final performance metric. Such an approach completely fails to capture the temporal aspect of video object detection (i.e., no notion of association) and is, therefore, insufficient as an evaluation protocol. Nonetheless, mAP is currently the go-to metric for object detection in video and discussions of changing/modifying this metric are seemingly minimal.

One interesting variant of mAP that incorporates temporal information is speed-based mAP (i.e., I made this name up for the context of this blog post; the associated reference does not provide a specific name). To compute this metric, objects are first divided into three different groups (i.e., slow, medium, and fast) based on how fast they are moving between frames (i.e., computed using motion IOU of adjacent frames). Then, mAP is computed separately for objects in each of these three groups, and the three mAP scores are presented separately. Although speed-based mAP provides a more granular view of object detection performance based on objects’ speed in the video, it provides three metrics instead of one and still does not capture the temporal aspect of video object detection; again, mAP contains no notion of association. Therefore, speed-based mAP is not a suitable metric for video object detection.

One final metric that has been proposed for video object detection is Average Delay (AD). AD captures the delay between an object entering a video and being “picked up” by the detector, which is referred to as “algorithmic latency”. This delay is measured in frames, such that an AD of 2 means that, on average, an object will exist in the video for two full frames until it is actually detected by the model. Although AD captures temporal information within video object detection, it is proposed as a metric to be used in union with mAP. Therefore, it is not a standalone metric that can be used to comprehensively evaluated the performance of object detectors in video.

Object Tracking Metrics

Although no comprehensive metrics have yet been proposed for video object detection, many useful metrics exist within the object tracking community from which inspiration can be drawn. Object tracking requires that objects (either one or multiple objects within each video frame) be identified, localized, and associated in a consistent manner throughout a video. Although object tracking contains no notion of classification, sufficient overlap exists between object tracking and video object detection to warrant a more in-depth examination of current evaluation metrics in object tracking.

Multiple Object Tracking Accuracy (MOTA) is one of the most widely-used metrics in object tracking. MOTA matches ground truth to predicted objects per-detection, meaning that each predicted and ground truth detection is treated as a separate entity during evaluation. At a high level, MOTA (based on a matching provided by the Hungarian algorithm) determines the number of identity switches (i.e., the same object is assigned a different identifier in adjacent video frames), false positives, and false negative detections across all video frames. Then, MOTA is computed by normalizing the aggregate sum of these components by the total number of ground truth objects in the video, as outlined in the equation below.

The MOTA Tracking Metric

MOTA captures association performance through identity switches (i.e., denoted as IDSW above), while detection performance is captured through false positives and negatives. However, MOTA does not consider localization. Rather, localization must be measured by a separate metric, multiple object tracking precision (MOTP), which averages localization scores across all detections within a video. Despite being a long-time, standardized metric for object tracking, MOTA has several shortcomings that limit its applicability to video object detection. Namely, it overemphasizes detection performance (i.e., the impact of identity switches on the above score is minimal), does not consider association beyond adjacent frames, does not consider localization, provides multiple scores instead of a unified metric, is highly-dependent on frame rate, and provides an unbounded score (i.e., MOTA can have a value of [-∞, 1]) that may be difficult to interpret. As such, MOTA is not sufficient as a metric for video object detection.

Other widely-used metrics within the object tracking community are IDF1 and track-mAP (also known as 3D-IoU), which match ground truth to predictions on a trajectory level (i.e., trajectories are defined as sequences of predicted or ground truth objects throughout video frames that share the same, unique identifier). In comparison to MOTA, IDF1 and track-mAP are not as widely-used within the object tracking community, so I will not provide an in-depth discussion of these metrics (see here for a more comprehensive discussion and comparison between metrics). However, IDF1 and track-mAP both have numerous limitations, which impede them from being adopted as standard metrics for video object detection. Namely, IDF1 overemphasizes association performance, does not consider localization, and ignores all association/detection outside of trajectories that are not matched with each other. Similarly, track-mAP requires each trajectory prediction to contain a confidence score, requires a trajectory distance metric to be defined by the user, and can be easily “gamed” (i.e., simple counterexamples can be provided which perform poorly but achieve a high track-mAP score).

Towards a Comprehensive Metric

Despite the limitations of metrics outlined in the previous section, there is a new standard metric within the object tracking community as of mid-2020: higher order tracking accuracy (HOTA). In this section, I will introduce the HOTA metric and explain why I believe it is an appropriate metric for evaluating video object detectors. Although HOTA does not consider classification performance, its classification-aware counterpart, CA-HOTA, effectively captures all relevant aspects of video object detection performance within a single evaluation metric.


The goal of HOTA is to (1) provide a single score that captures all relevant aspects of tracking performance, (2) enable long-term association performance to be measured (i.e., association beyond two adjacent frames), and (3) decompose into different sub-metrics that capture more detailed aspects of tracking performance. HOTA, which aims to mitigate issues with previous tracking metrics, provides a single score within the range [0, 1] that captures all relevant aspects of tracking performance (i.e., detection, association, and localization) in a balanced manner. Additionally, this single, interpretable score can be decomposed into sub-metrics that characterize different aspects of tracking performance on a more granular level. The benefits of HOTA in comparison to other widely-used tracking metrics are summarized by the following figure.

HOTA in Comparison to Other Tracking Metrics [7]

Although HOTA does not consider classification performance, variants have been proposed that incorporate classification into the HOTA score (e.g., CA-HOTA). CA-HOTA captures all aspects of performance for video object detection (i.e., association, localization, detection, and classification) within a single, interpretable metric. As such, CA-HOTA can be considered a (relatively) comprehensive metric for video object detection.

What is HOTA?

An Illustration of the HOTA Metric [7]

Similar to MOTA, HOTA matches predictions and ground truth objects at a detection level (i.e., as opposed to a trajectory level like in IDF1 or track-mAP). Within HOTA, two categories of metrics are computed from the ground truth and predicted object matches (again, produced by the Hungarian algorithm): detection components and association components. Detection components are simply true positives, false negative, and false positives, which have been discussed previously. Association components, which are somewhat different, include true positive associations (TPA), false negative associations (FNA), and false positive associations (FPA).

Consider an object with a valid match in a particular frame (i.e., a true positive detection), which we denote as c. Within this true positive detection, both the predicted and ground truth objects must be assigned a unique identifier. To compute the number of TPAs, one simply finds the number of true positives in other frames that share the same ground truth and predicted identifier as c and repeats this process for every possible c within the video (i.e., every true positive detection). FNAs and FPAs are defined in a similar manner, but one must find detections in other frames that have the same ground truth identifier and a different predicted identifier or the same predicted identifier and a different ground truth identifier, respectively. Essentially, TPAs, FPAs, and FNAs allow association performance to be measured across all frames within a video, instead of only between adjacent frames. Hence, the name “higher order” tracking accuracy.

Given a certain localization threshold (i.e., the localization threshold modifies matches produced by the Hungarian algorithm), we can compute all detection and association components of HOTA. Then, the aggregate HOTA score at a given localization threshold (which we denote as alpha) can be computed as follows.

HOTA Metric at a Single Localization Threshold

Once HOTA has been computed at a specific localization threshold, the aggregate HOTA score can be derived by averaging the above metric across several different localization thresholds. Typically, a sample of localization thresholds are taken in the range [0, 1], as shown in the equation below.

HOTA Metric

As revealed by the equations above, HOTA captures detection (through detection components), association (through association components), and localization (through localization thresholds) performance. Furthermore, HOTA can be decomposed into numerous sub-metrics. Each of these sub-metrics, outlined in the figure below, can be used to analyze a more specific aspect of tracker performance (e.g., localization, detection, or association performance in isolation).

Sub-Metrics within HOTA

For a more comprehensive discussion of how the HOTA metric is computed, one can also read the associated paper or blog post.

What about classification?

I previously mentioned that HOTA can be modified to also capture classification performance, but never clarified exactly how this is done. Although multiple variants for classification-aware HOTA are proposed [7], one possible method is to modify the HOTA score at a certain localization threshold as follows.

Classification Aware HOTA (CA-HOTA)

As can be seen above, classification performance is incorporated by scaling all contributions to the HOTA metric by the associated confidence of the correct class for a given true positive detection. As a result, the HOTA metric will deteriorate if classification performance is poor. Given the above definition, the aggregate CA-HOTA metric can then be computed as an average of scores over different localization thresholds (i.e., just as for vanilla HOTA), yielding the CA-HOTA metric.


Currently, the video object detection community lacks a unified evaluation metric that correctly captures model performance. Commonly, mAP metrics are used and averaged across video frames to characterize performance, but such a technique completely disregards the temporal aspect of model performance. Drawing upon recent progress in the object tracking community, I claim that the higher order tracking accuracy (HOTA) metric is a suitable evaluation criterion for object detection in video. The classification aware variant of HOTA, CA-HOTA, captures all relevant aspects of video object detection performance, including detection, association, localization, and classification. As such, it is a comprehensive metric (especially in comparison to static metrics like mAP) that can and should be used for benchmarking different methodologies in video object detection. I hope this writeup will spark discussion within the community and lead to more standardized and comprehensive benchmarking for object detection in video.

For more of Cameron’s work, check out his Medium page here.

Learn More About Our Annotation Solutions