YOLOV: Making Still Image Object Detectors Great at Video Object Detection

Yuheng Shi; Naiyan Wang; Xiaojie Guo

YOLOV: 静止画像オブジェクト検出器を動画オブジェクト検出に最適化する

ビデオオブジェクト検出 (VID) は、オブジェクトの外観の変動が大きく、一部のフレームでさまざまな劣化が発生するため、困難です。良い面としては、ビデオの特定のフレームでの検出は、静止画像での検出と比較して、他のフレームからのサポートを引き出すことができます。したがって、異なるフレーム間で機能を集約する方法は、VID 問題にとって極めて重要です。既存の集計アルゴリズムのほとんどは、2 段階検出器用にカスタマイズされています。ただし、これらの検出器は通常、2 段階の性質があるため、計算コストが高くなります。この作業では、上記の懸念に対処するためのシンプルで効果的な戦略を提案します。これにより、精度が大幅に向上し、わずかなオーバーヘッドが発生します。具体的には、従来の 2 段階のパイプラインとは異なり、1 段階の検出後に重要な領域を選択して、大量の低品質の候補を処理しないようにします。さらに、アグリゲーションをガイドするために、ターゲットフレームと参照フレームの関係を評価します。私たちは広範な実験とアブレーション研究を実施して、デザインの有効性を検証し、有効性と効率の両方で他の最先端の VID アプローチよりも優れていることを明らかにします。当社の YOLOX ベースのモデルは有望なパフォーマンス (たとえば、単一の 2080Ti GPU 上の ImageNet VID データセットで 30 FPS 以上で 87.5% AP50) を達成できるため、大規模またはリアルタイムアプリケーションにとって魅力的です。実装は簡単です。デモコードとモデルは https://github.com/YuHengsss/YOLOV で入手できます。

Video object detection (VID) is challenging because of the high variation of object appearance as well as the diverse deterioration in some frames. On the positive side, the detection in a certain frame of a video, compared with that in a still image, can draw support from other frames. Hence, how to aggregate features across different frames is pivotal to VID problem. Most of existing aggregation algorithms are customized for two-stage detectors. However, these detectors are usually computationally expensive due to their two-stage nature. This work proposes a simple yet effective strategy to address the above concerns, which costs marginal overheads with significant gains in accuracy. Concretely, different from traditional two-stage pipeline, we select important regions after the one-stage detection to avoid processing massive low-quality candidates. Besides, we evaluate the relationship between a target frame and reference frames to guide the aggregation. We conduct extensive experiments and ablation studies to verify the efficacy of our design, and reveal its superiority over other state-of-the-art VID approaches in both effectiveness and efficiency. Our YOLOX-based model can achieve promising performance (e.g., 87.5% AP50 at over 30 FPS on the ImageNet VID dataset on a single 2080Ti GPU), making it attractive for large-scale or real-time applications. The implementation is simple, we have made the demo codes and models available at https://github.com/YuHengsss/YOLOV.

updated: Sun Mar 05 2023 09:22:53 GMT+0000 (UTC)

published: Sat Aug 20 2022 14:12:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト