TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers

Qianyu Zhou; Xiangtai Li; Lu He; Yibo Yang; Guangliang Cheng; Yunhai Tong; Lizhuang Ma; Dacheng Tao

TransVOD: 時空間トランスフォーマーによるエンドツーエンドのビデオオブジェクト検出

検出トランス (DETR) と変形可能な DETR は、以前の複雑な手作りの検出器として優れた性能を発揮しながら、オブジェクト検出で多くの手作業で設計されたコンポーネントの必要性を排除するために提案されています。ただし、ビデオオブジェクト検出 (VOD) でのパフォーマンスは十分に調査されていません。このホワイトペーパーでは、時空間トランスフォーマーアーキテクチャに基づく最初のエンドツーエンドのビデオオブジェクト検出システムである TransVOD を紹介します。このホワイトペーパーの最初の目標は、VOD のパイプラインを合理化し、オプティカルフローモデルや関係ネットワークなど、機能集約のための多くの手作りコンポーネントの必要性を効果的に取り除くことです。さらに、DETR のオブジェクトクエリ設計の恩恵を受けて、私たちの方法は Seq-NMS などの複雑な後処理方法を必要としません。特に、空間オブジェクトクエリと各フレームの特徴メモリの両方を集約するための一時的なトランスフォーマーを提示します。テンポラルトランスフォーマーは、オブジェクトクエリを融合するテンポラルクエリエンコーダー (TQE) と、現在のフレーム検出結果を取得するテンポラル変形可能トランスフォーマーデコーダー (TDTD) の 2 つのコンポーネントで構成されます。これらの設計は、ImageNet VID データセットで、強力なベースラインの変形可能 DETR を大幅に (3% ～ 4% mAP) 向上させます。次に、TransVOD++ と TransVOD Lite を含む 2 つの改良版 TransVOD を紹介します。前者はオブジェクトレベルの情報を動的畳み込みを介してオブジェクトクエリに融合し、後者はビデオクリップ全体を出力としてモデル化し、推論時間を短縮します。実験の部分で、3 つのモデルすべての詳細な分析を行います。特に、私たちが提案した TransVOD++ は、90.0% の mAP で ImageNet VID の精度に関して新しい最先端の記録を打ち立てました。当社が提案する TransVOD Lite は、単一の V100 GPU デバイスで約 30 FPS で実行しながら、83.7% の mAP で最高の速度と精度のトレードオフも実現します。

Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS on a single V100 GPU device.

updated: Tue Nov 22 2022 06:07:22 GMT+0000 (UTC)

published: Thu Jan 13 2022 16:17:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト