TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers

Qianyu Zhou; Xiangtai Li; Lu He; Yibo Yang; Guangliang Cheng; Yunhai Tong; Lizhuang Ma; Dacheng Tao

TransVOD：時空間トランスフォーマーを使用したエンドツーエンドのビデオオブジェクト検出

検出トランス（DETR）と変形可能なDETRは、以前の複雑な手作りの検出器として優れたパフォーマンスを示しながら、オブジェクト検出における多くの手動で設計されたコンポーネントの必要性を排除するために提案されました。ただし、ビデオオブジェクト検出（VOD）でのパフォーマンスは十分に調査されていません。この論文では、時空間トランスフォーマーアーキテクチャに基づく最初のエンドツーエンドのビデオオブジェクト検出システムであるTransVODを紹介します。このホワイトペーパーの最初の目標は、VODのパイプラインを合理化し、オプティカルフローモデルやリレーションネットワークなど、機能を集約するための多くの手作りコンポーネントの必要性を効果的に排除することです。さらに、DETRのオブジェクトクエリ設計の恩恵を受けて、私たちのメソッドはSeq-NMSのような複雑な後処理メソッドを必要としません。特に、空間オブジェクトクエリと各フレームの特徴メモリの両方を集約するための時間トランスフォーマーを提示します。テンポラルトランスフォーマーは、オブジェクトクエリを融合するためのTemporal Query Encoder（TQE）と、現在のフレーム検出結果を取得するためのTemporal Deformable Transformer Decoder（TDTD）の2つのコンポーネントで構成されています。これらの設計は、ImageNet VIDデータセットの強力なベースライン変形可能DETRを大幅に（3％〜4％mAP）ブーストします。次に、TransVOD ++とTransVODLiteを含む2つの改良されたバージョンのTransVODを紹介します。前者は動的畳み込みを介してオブジェクトレベルの情報をオブジェクトクエリに融合し、後者はビデオクリップ全体を出力としてモデル化して推論時間を短縮します。実験パートでは、3つのモデルすべての詳細な分析を行います。特に、提案されたTransVOD ++は、90.0％mAPのImageNetVIDの精度に関して新しい最先端の記録を打ち立てます。また、提案されたTransVOD Liteは、単一のV100GPUデバイスで約30FPSで実行しながら、83.7％のmAPで最高の速度と精度のトレードオフを実現します。コードとモデルは、今後の調査に利用できるようになります。

Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS on a single V100 GPU device. Code and models will be available for further research.

updated: Mon Jan 17 2022 02:06:34 GMT+0000 (UTC)

published: Thu Jan 13 2022 16:17:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト