Video Self-Stitching Graph Network for Temporal Action Localization

Chen Zhao; Ali Thabet; Bernard Ghanem

時間的アクションローカリゼーションのためのビデオセルフスティッチンググラフネットワーク

ビデオの時間的アクションローカリゼーション（TAL）は、特にアクションの大規模なバリエーションのために、困難な作業です。データでは、通常、短いアクションが大部分を占めますが、現在のすべての方法でパフォーマンスが最も低くなります。この論文では、短いアクションの課題に立ち向かい、ビデオセルフスティッチンググラフネットワーク（VSGN）と呼ばれるマルチレベルのクロススケールソリューションを提案します。 VSGNには、ビデオセルフスティッチング（VSS）とクロススケールグラフピラミッドネットワーク（xGPN）の2つの主要コンポーネントがあります。 VSSでは、短期間のビデオに焦点を当て、それを時間的次元に沿って拡大して、より大きなスケールを取得します。セルフステッチのアプローチにより、元のクリップとその拡大された対応物を1つの入力シーケンスで利用して、両方のスケールの補完的な特性を利用することができます。 xGPNコンポーネントは、クロススケールグラフネットワークのピラミッドによるクロススケール相関をさらに活用します。各ネットワークには、ハイブリッド時間グラフモジュールが含まれており、スケール間および同じスケール内の特徴を集約します。私たちのVSGNは、特徴表現を強化するだけでなく、短いアクションや短いトレーニングサンプルに対してよりポジティブなアンカーを生成します。実験によると、VSGNは明らかに短いアクションのローカリゼーションパフォーマンスを向上させるだけでなく、ActivityNet-v1.3で最先端の全体的なパフォーマンスを達成し、平均mAPは35.07％に達します。

Temporal action localization (TAL) in videos is a challenging task, especially due to the large scale variation of actions. In the data, short actions usually occupy the major proportion, but have the lowest performance with all current methods. In this paper, we confront the challenge of short actions and propose a multi-level cross-scale solution dubbed as video self-stitching graph network (VSGN). We have two key components in VSGN: video self-stitching (VSS) and cross-scale graph pyramid network (xGPN). In VSS, we focus on a short period of a video and magnify it along the temporal dimension to obtain a larger scale. By our self-stitching approach, we are able to utilize the original clip and its magnified counterpart in one input sequence to take advantage of the complementary properties of both scales. The xGPN component further exploits the cross-scale correlations by a pyramid of cross-scale graph networks, each containing a hybrid temporal-graph module to aggregate features from across scales as well as within the same scale. Our VSGN not only enhances the feature representations, but also generates more positive anchors for short actions and more short training samples. Experiments demonstrate that VSGN obviously improves the localization performance of short actions as well as achieving the state-of-the-art overall performance on ActivityNet-v1.3, reaching an average mAP of 35.07 %.

updated: Sun Dec 13 2020 07:04:05 GMT+0000 (UTC)

published: Mon Nov 30 2020 07:44:52 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト