Target Adaptive Context Aggregation for Video Scene Graph Generation

Yao Teng; Limin Wang; Zhifeng Li; Gangshan Wu

ビデオシーングラフ生成のためのターゲット適応コンテキスト集約

このホワイトペーパーでは、ビデオシーングラフ生成（VidSGG）の難しいタスクを扱います。これは、高レベルの理解タスクの構造化されたビデオ表現として機能する可能性があります。複雑な低レベルのエンティティ追跡から関係予測のコンテキストモデリングを分離することにより、このタスクの新しい検出から追跡へのパラダイムを提示します。具体的には、関係認識のための時空間コンテキスト情報のキャプチャに焦点を当てて、ターゲット適応コンテキスト集約ネットワーク（TRACE）と呼ばれるフレームレベルのVidSGGの効率的な方法を設計します。私たちのTRACEフレームワークは、モジュラー設計でVidSGGパイプラインを合理化し、階層関係ツリー（HRTree）構築とターゲット適応型コンテキスト集約の2つの固有のブロックを提示します。より具体的には、HRTreeは最初に、可能な関係候補を効率的に編成するための適応構造を提供し、コンテキスト集約モジュールをガイドして、時空間構造情報を効果的にキャプチャします。次に、各関係候補のコンテキスト化された特徴表現を取得し、その関係カテゴリを認識するための分類ヘッドを構築します。最後に、TRACEで検出された結果を追跡して、ビデオレベルのVidSGGを生成するための単純な時間的関連付け戦略を提供します。 ImageNet-VidVRDとActionGenomeの2つのVidSGGベンチマークで実験を行い、その結果は、TRACEが最先端のパフォーマンスを達成していることを示しています。コードとモデルはhttps://github.com/MCG-NJU/TRACEで入手できます。

This paper deals with a challenging task of video scene graph generation (VidSGG), which could serve as a structured video representation for high-level understanding tasks. We present a new detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking. Specifically, we design an efficient method for frame-level VidSGG, termed as Target Adaptive Context Aggregation Network (TRACE), with a focus on capturing spatio-temporal context information for relation recognition. Our TRACE framework streamlines the VidSGG pipeline with a modular design, and presents two unique blocks of Hierarchical Relation Tree (HRTree) construction and Target-adaptive Context Aggregation. More specific, our HRTree first provides an adpative structure for organizing possible relation candidates efficiently, and guides context aggregation module to effectively capture spatio-temporal structure information. Then, we obtain a contextualized feature representation for each relation candidate and build a classification head to recognize its relation category. Finally, we provide a simple temporal association strategy to track TRACE detected results to yield the video-level VidSGG. We perform experiments on two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the results demonstrate that our TRACE achieves the state-of-the-art performance. The code and models are made available at https://github.com/MCG-NJU/TRACE.

updated: Wed Aug 18 2021 12:46:28 GMT+0000 (UTC)

published: Wed Aug 18 2021 12:46:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト