What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

Sangmin Woo; Junhyug Noh; Kangil Kim

何をいつ見るか?: ビデオ関係検出のための時間スパン提案ネットワーク

オブジェクト間の関係を識別することは、シーンを理解する上で重要です。画像ドメインでの関係モデリングについていくつかの研究が提案されていますが、時空間相互作用のダイナミクスに挑戦するため、ビデオドメインには多くの制約がありました (たとえば、どのオブジェクト間で相互作用があるか?関係はいつ開始および終了するか? ）。これまで、Video Visual Relation Detection (VidVRD) に取り組むために、セグメントベースとウィンドウベースの 2 つの代表的な方法が提案されてきました。最初にこれらの方法の限界を指摘し、Temporal Span Proposal Network (TSPN) という名前の新しいアプローチを提案します。 TSPN は、何を見るかを指示します。つまり、オブジェクトペアの関係性をスコアリングすることにより、関係検索スペースをスパース化します。つまり、関係が存在する可能性を測定します。 TSPN は、いつ見るかを指示します。完全なビデオコンテキストを利用して、開始と終了のタイムスタンプ (つまり、時間スパン) とすべての可能な関係のカテゴリを同時に予測します。これら 2 つの設計により、双方にとって好都合なシナリオが可能になります。既存の方法よりも 2 倍以上トレーニングを高速化し、2 つの VidVRD ベンチマーク (ImageNet-VidVDR と VidOR) で競争力のあるパフォーマンスを達成します。さらに、包括的なアブレーション実験は、私たちのアプローチの有効性を示しています。コードは https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD で入手できます。

Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out limitations of these methods and propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD.

updated: Wed Oct 05 2022 09:41:51 GMT+0000 (UTC)

published: Thu Jul 15 2021 07:01:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト