AEI: Actors-Environment Interaction with Adaptive Attention for Temporal Action Proposals Generation

Khoa Vo; Hyekang Joo; Kashu Yamazaki; Sang Truong; Kris Kitani; Ngan Le

AEI：アクター-時間的行動提案生成のための適応的注意を伴う環境相互作用

人間は通常、俳優と周囲の環境との相互作用を通じて、ビデオ内のアクションの確立を認識します。アクションは、ビデオのメインアクターが環境との対話を開始したときにのみ開始され、メインアクターが対話を停止したときに終了します。時間的アクション提案の生成は大きく進歩しましたが、既存のほとんどの作品は前述の事実を無視し、モデル学習を残してアクションをブラックボックスとして提案します。この論文では、アクター環境インタラクション（AEI）ネットワークを提案して、時間的アクション提案生成のビデオ表現を改善することにより、人間のその能力をシミュレートすることを試みます。 AEIには、知覚ベースの視覚表現（PVR）と境界マッチングモジュール（BMM）の2つのモジュールが含まれています。 PVRは、提案された適応型アテンションメカニズムを使用して、人間と人間の関係および人間と環境の関係を考慮に入れることにより、各ビデオスニペットを表します。次に、ビデオ表現がBMMによって取得され、アクションの提案が生成されます。 AEIは、ActivityNet-1.3およびTHUMOS-14データセットで、2つの境界マッチングアーキテクチャ（つまり、CNNベースおよびGCNベース）と2つの分類器（つまり、UnetおよびP-GCN）を使用して、時間的アクションの提案および検出タスクで包括的に評価されます。）。当社のAEIは、時間的アクション提案の生成と時間的アクションの検出の両方で、卓越したパフォーマンスと一般化を備えた最先端の方法を大幅に上回っています。

Humans typically perceive the establishment of an action in a video through the interaction between an actor and the surrounding environment. An action only starts when the main actor in the video begins to interact with the environment, while it ends when the main actor stops the interaction. Despite the great progress in temporal action proposal generation, most existing works ignore the aforementioned fact and leave their model learning to propose actions as a black-box. In this paper, we make an attempt to simulate that ability of a human by proposing Actor Environment Interaction (AEI) network to improve the video representation for temporal action proposals generation. AEI contains two modules, i.e., perception-based visual representation (PVR) and boundary-matching module (BMM). PVR represents each video snippet by taking human-human relations and humans-environment relations into consideration using the proposed adaptive attention mechanism. Then, the video representation is taken by BMM to generate action proposals. AEI is comprehensively evaluated in ActivityNet-1.3 and THUMOS-14 datasets, on temporal action proposal and detection tasks, with two boundary-matching architectures (i.e., CNN-based and GCN-based) and two classifiers (i.e., Unet and P-GCN). Our AEI robustly outperforms the state-of-the-art methods with remarkable performance and generalization for both temporal action proposal generation and temporal action detection.

updated: Thu Oct 21 2021 20:43:42 GMT+0000 (UTC)

published: Thu Oct 21 2021 20:43:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト