Temporal-Relational CrossTransformers for Few-Shot Action Recognition

Toby Perrett; Alessandro Masullo; Tilo Burghardt; Majid Mirmehdi; Dima Damen

少数のショット行動認識のための時間的関係クロストランスフォーマー

クエリとサポートセット内のビデオの間で時間的に対応するフレームタプルを見つける、数ショットの行動認識への新しいアプローチを提案します。以前の数ショットの作品とは異なり、CrossTransformerアテンションメカニズムを使用してクラスのプロトタイプを作成し、クラスの平均や単一のベストマッチを使用するのではなく、すべてのサポートビデオの関連するサブシーケンスを観察します。ビデオ表現は、さまざまなフレーム数の順序付けられたタプルから形成されます。これにより、さまざまな速度と時間オフセットでのアクションのサブシーケンスを比較できます。私たちが提案するTemporal-RelationalCrossTransformers（TRX）は、Kinetics、Something-Something V2（SSv2）、HMDB51、およびUCF101の数ショット分割で最先端の結果を実現します。重要なことに、私たちの方法は、時間的関係をモデル化する能力があるため、SSv2での以前の作業を大幅に上回っています（12％）。詳細なアブレーションは、複数のサポートセットビデオとのマッチングと高次のリレーショナルCrossTransformerの学習の重要性を示しています。

We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.

updated: Thu Mar 18 2021 15:02:00 GMT+0000 (UTC)

published: Fri Jan 15 2021 15:47:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト