Pyramid Region-based Slot Attention Network for Temporal Action Proposal Generation

Shuaicheng Li; Feng Zhang; Rui-Wei Zhao; Rui Feng; Kunlin Yang; Lingbo Liu; Jun Hou

時間的行動提案生成のためのピラミッド領域ベースのスロット注意ネットワーク

トリミングされていないビデオの開始フレームと終了フレームの範囲内で時間アクションインスタンスを検出することを目的とした時間アクション提案の生成は、適切な時間的およびセマンティックコンテキストの活用から大いに恩恵を受けることができることがわかっています。最新の取り組みは、自己注意モジュールを介して時間的コンテキストと類似性ベースのセマンティックコンテキストを検討することに専念しました。ただし、それらは依然として雑然とした背景情報と限られた文脈的特徴学習に苦しんでいます。この論文では、これらの問題に対処するために、新しいピラミッド領域ベースのスロットアテンション（PRSlot）モジュールを提案します。類似性の計算を使用する代わりに、PRSlotモジュールは、エンコーダーデコーダー方式でローカル関係を直接学習し、スロットと呼ばれる入力機能に対する注意に基づいて拡張されたローカル領域の表現を生成します。具体的には、入力スニペットレベルの機能で、PRSlotモジュールはターゲットスニペットをクエリとして受け取り、その周囲の領域をキーとして受け取り、ローカルスニペットコンテキストを並列ピラミッド戦略で集約することにより、各クエリキースロットのスロット表現を生成します。 PRSlotモジュールに基づいて、PRSA-Netと呼ばれる新しいピラミッド領域ベースのスロットアテンションネットワークを提示し、より良い提案生成のための豊富な時間的および意味的コンテキストを備えた統一された視覚的表現を学習します。広く採用されている2つのTHUMOS14およびActivityNet-1.3ベンチマークで広範な実験が行われます。私たちのPRSA-Netは、他の最先端の方法よりも優れています。特に、AR @ 100を以前の最高の50.67％から56.12％に改善して提案を生成し、THUMOS14でのアクション検出のmAPを0.5 tIoU未満で51.9％から58.7％に引き上げます。コードはhttps://github.com/handhand123/PRSA-Netで入手できます。

It has been found that temporal action proposal generation, which aims to discover the temporal action instances within the range of the start and end frames in the untrimmed videos, can largely benefit from proper temporal and semantic context exploitation. The latest efforts were dedicated to considering the temporal context and similarity-based semantic contexts through self-attention modules. However, they still suffer from cluttered background information and limited contextual feature learning. In this paper, we propose a novel Pyramid Region-based Slot Attention (PRSlot) module to address these issues. Instead of using the similarity computation, our PRSlot module directly learns the local relations in an encoder-decoder manner and generates the representation of a local region enhanced based on the attention over input features called slot. Specifically, upon the input snippet-level features, PRSlot module takes the target snippet as query, its surrounding region as key and then generates slot representations for each query-key slot by aggregating the local snippet context with a parallel pyramid strategy. Based on PRSlot modules, we present a novel Pyramid Region-based Slot Attention Network termed PRSA-Net to learn a unified visual representation with rich temporal and semantic context for better proposal generation. Extensive experiments are conducted on two widely adopted THUMOS14 and ActivityNet-1.3 benchmarks. Our PRSA-Net outperforms other state-of-the-art methods. In particular, we improve the AR@100 from the previous best 50.67% to 56.12% for proposal generation and raise the mAP under 0.5 tIoU from 51.9% to 58.7% for action detection on THUMOS14. Code is available at https://github.com/handhand123/PRSA-Net

updated: Tue Jun 21 2022 03:40:58 GMT+0000 (UTC)

published: Tue Jun 21 2022 03:40:58 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト