Inductive Attention for Video Action Anticipation

Tsung-Ming Tai; Giuseppe Fiameni; Cheng-Kuang Lee; Simon See; Oswald Lanz

ビデオアクション予測のための誘導的注意

ビデオの観察に基づいて将来の行動を予測することは、ビデオの理解において重要なタスクです。これは、イベントが発生する前に反応するために応答時間を必要とするいくつかの予防システムに役立ちます。アクション予測の入力はアクション前のフレームのみであるため、モデルにはターゲットアクションに関する十分な情報がありません。さらに、同様のプレアクションフレームは、異なる未来につながる可能性があります。したがって、既存のアクション認識モデルを使用するソリューションは、最適ではない可能性があります。最近、研究者は、より長いビデオコンテキストを使用して、行動前の間隔で不十分な情報を修正すること、および予測の問題に対処するために過去の関連する瞬間を照会する自己注意を提案しました。ただし、ビデオ入力機能をクエリとして間接的に使用することは、予測目標のプロキシとしてのみ機能するため、非効率的である可能性があります。この目的のために、過去の経験からの誘導によって予想結果を導き出すためのクエリとして事前予測を透過的に使用する誘導的注意モデルを提案します。私たちの方法は、多対多の関連付けを介して複数の先物の不確実性を自然に考慮します。大規模な自己中心的なビデオデータセットでは、私たちのモデルは、同じバックボーンを使用する最先端技術よりも一貫して優れたパフォーマンスを示すだけでなく、より強力なバックボーンを使用する方法と競合するだけでなく、より少ないモデルパラメーターで優れた効率も示します。

Anticipating future actions based on video observations is an important task in video understanding, which would be useful for some precautionary systems that require response time to react before an event occurs. Since the input in action anticipation is only pre-action frames, models do not have enough information about the target action; moreover, similar pre-action frames may lead to different futures. Consequently, any solution using existing action recognition models can only be suboptimal. Recently, researchers have proposed using a longer video context to remedy the insufficient information in pre-action intervals, as well as the self-attention to query past relevant moments to address the anticipation problem. However, the indirect use of video input features as the query might be inefficient, as it only serves as the proxy to the anticipation goal. To this end, we propose an inductive attention model, which transparently uses prior prediction as the query to derive the anticipation result by induction from past experience. Our method naturally considers the uncertainty of multiple futures via the many-to-many association. On the large-scale egocentric video datasets, our model not only shows consistently better performance than state of the art using the same backbone, and is competitive to the methods that employ a stronger backbone, but also superior efficiency in less model parameters.

updated: Sat Dec 17 2022 09:51:17 GMT+0000 (UTC)

published: Sat Dec 17 2022 09:51:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト