Anticipative Video Transformer

Rohit Girdhar; Kristen Grauman

予想ビデオトランスフォーマー

将来のアクションを予測するために、以前に観察されたビデオに対応する、エンドツーエンドの注意ベースのビデオモデリングアーキテクチャであるPrecipative Video Transformer（AVT）を提案します。モデルを共同でトレーニングして、ビデオシーケンスの次のアクションを予測すると同時に、連続する将来のフレームの特徴を予測するフレーム特徴エンコーダーも学習します。既存の時間的集約戦略と比較して、AVTには、観測されたアクションの順次進行を維持しながら、長距離の依存関係をキャプチャするという利点があります。どちらも予測タスクにとって重要です。広範な実験を通じて、AVTが4つの人気のあるアクション予測ベンチマーク（EpicKitchens-55、EpicKitchens-100、EGTEA Gaze +、および50-Salads）で報告された最高のパフォーマンスを取得することを示します。そして、EpicKitchens-100CVPR'21チャレンジで1位を獲得しました。

We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal aggregation strategies, AVT has the advantage of both maintaining the sequential progression of observed actions while still capturing long-range dependencies--both critical for the anticipation task. Through extensive experiments, we show that AVT obtains the best reported performance on four popular action anticipation benchmarks: EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads; and it wins first place in the EpicKitchens-100 CVPR'21 challenge.

updated: Wed Sep 22 2021 17:06:02 GMT+0000 (UTC)

published: Thu Jun 03 2021 17:57:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト