YOLOV: Making Still Image Object Detectors Great at Video Object Detection

Yuheng Shi; Naiyan Wang; Xiaojie Guo

YOLOV: 静止画像オブジェクト検出器を動画オブジェクト検出に最適化する

ビデオオブジェクト検出 (VID) は、オブジェクトの外観の変動が大きく、一部のフレームでさまざまな劣化が発生するため、困難です。良い面としては、ビデオの特定のフレームでの検出は、静止画像での検出と比較して、他のフレームからのサポートを引き出すことができます。したがって、異なるフレーム間で機能を集約する方法は、VID 問題にとって極めて重要です。既存の集計アルゴリズムのほとんどは、2 段階検出器用にカスタマイズされています。ただし、このカテゴリの検出器は通常、2 段階の性質のために計算コストが高くなります。この作業は、上記の懸念に対処するためのシンプルで効果的な戦略を提案します。これにより、精度が大幅に向上し、わずかなオーバーヘッドが費やされます。具体的には、従来の 2 段階のパイプラインとは異なり、大量の低品質の候補を処理しないように、1 段階の検出の後に領域レベルの選択を行うことをお勧めします。さらに、ターゲットフレームとその参照フレームとの関係を評価し、集約を導くために、新しいモジュールが構築されます。私たちの設計の有効性を検証し、有効性と効率の両方で他の最先端のVIDアプローチよりも優れていることを明らかにするために、広範な実験とアブレーション研究が行われています。当社の YOLOX ベースのモデルは有望なパフォーマンス (たとえば、単一の 2080Ti GPU 上の ImageNet VID データセットで 30 FPS 以上で 87.5% AP50) を達成できるため、大規模またはリアルタイムアプリケーションにとって魅力的です。実装は簡単で、デモコードとモデルは https://github.com/YuHengsss/YOLOV で入手できます。

Video object detection (VID) is challenging because of the high variation of object appearance as well as the diverse deterioration in some frames. On the positive side, the detection in a certain frame of a video, compared with in a still image, can draw support from other frames. Hence, how to aggregate features across different frames is pivotal to the VID problem. Most of existing aggregation algorithms are customized for two-stage detectors. But, the detectors in this category are usually computationally expensive due to the two-stage nature. This work proposes a simple yet effective strategy to address the above concerns, which spends marginal overheads with significant gains in accuracy. Concretely, different from the traditional two-stage pipeline, we advocate putting the region-level selection after the one-stage detection to avoid processing massive low-quality candidates. Besides, a novel module is constructed to evaluate the relationship between a target frame and its reference ones, and guide the aggregation. Extensive experiments and ablation studies are conducted to verify the efficacy of our design, and reveal its superiority over other state-of-the-art VID approaches in both effectiveness and efficiency. Our YOLOX-based model can achieve promising performance (e.g., 87.5% AP50 at over 30 FPS on the ImageNet VID dataset on a single 2080Ti GPU), making it attractive for large-scale or real-time applications. The implementation is simple, the demo code and models have been made available at https://github.com/YuHengsss/YOLOV .

updated: Sat Aug 20 2022 14:12:06 GMT+0000 (UTC)

published: Sat Aug 20 2022 14:12:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト