End-to-End Video Object Detection with Spatial-Temporal Transformers

Lu He; Qianyu Zhou; Xiangtai Li; Li Niu; Guangliang Cheng; Xiao Li; Wenxuan Liu; Yunhai Tong; Lizhuang Ma; Liqing Zhang

時空間トランスフォーマーによるエンドツーエンドのビデオオブジェクト検出

最近、DETRおよびDeformable DETRが提案され、以前の複雑な手作りの検出器として優れたパフォーマンスを示しながら、オブジェクト検出における多くの手動で設計されたコンポーネントの必要性を排除しました。ただし、ビデオオブジェクト検出（VOD）でのパフォーマンスは十分に調査されていません。この論文では、時空間Transformerアーキテクチャに基づくエンドツーエンドのビデオオブジェクト検出モデルであるTransVODを紹介します。このホワイトペーパーの目的は、VODのパイプラインを合理化し、オプティカルフロー、リカレントニューラルネットワーク、リレーションネットワークなど、機能を集約するための多くの手作りコンポーネントの必要性を効果的に排除することです。さらに、DETRのオブジェクトクエリ設計の恩恵を受けて、私たちのメソッドは、パイプラインをシンプルでクリーンに保つSeq-NMSやTubelet再スコアリングなどの複雑な後処理メソッドを必要としません。特に、空間オブジェクトクエリと各フレームの特徴メモリの両方を集約するための時間トランスフォーマーを紹介します。テンポラルトランスフォーマーは、複数フレームの空間詳細をエンコードするテンポラルデフォーマブルトランスフォーマーエンコーダー（TDTE）、オブジェクトクエリを融合するテンポラルクエリエンコーダー（TQE）、現在のフレーム検出結果を取得するテンポラルデフォーマブルトランスフォーマーデコーダーの3つのコンポーネントで構成されています。これらの設計は、ImageNet VIDデータセットの強力なベースライン変形可能DETRを大幅に（3％〜4％mAP）ブーストします。 TransVODは、ImageNetVIDのベンチマークで同等の結果パフォーマンスをもたらします。 TransVODがビデオオブジェクト検出の新しい視点を提供できることを願っています。コードはhttps://github.com/SJTU-LuHe/TransVODで公開されます。

Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal Transformer consists of three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective for video object detection. Code will be made publicly available at https://github.com/SJTU-LuHe/TransVOD.

updated: Sun May 23 2021 11:44:22 GMT+0000 (UTC)

published: Sun May 23 2021 11:44:22 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト