Anticipative Video Transformer

Rohit Girdhar; Kristen Grauman

予測ビデオトランスフォーマー

将来のアクションを予測するために以前に観察されたビデオに参加するエンドツーエンドのアテンションベースのビデオモデリングアーキテクチャである予測ビデオトランスフォーマー (AVT) を提案します。ビデオシーケンスの次のアクションを予測するためにモデルを共同でトレーニングし、連続する将来のフレームの特徴を予測するフレーム特徴エンコーダーも学習します。既存の時間的集約戦略と比較して、AVT には、観測されたアクションの順次進行を維持しながら、長期的な依存関係をキャプチャーするという両方の利点があります。どちらも予測タスクにとって重要です。広範な実験を通じて、AVT は、EpicKitchens-55、EpicKitchens-100、EGTEA Gaze+、および 50-Salads の 4 つの一般的なアクション予測ベンチマークで報告された最高のパフォーマンスを取得することを示しています。これには、EpicKitchens-100 CVPR'21 チャレンジへのすべての送信を上回るパフォーマンスも含まれます。

We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next action in a video sequence, while also learning frame feature encoders that are predictive of successive future frames' features. Compared to existing temporal aggregation strategies, AVT has the advantage of both maintaining the sequential progression of observed actions while still capturing long-range dependencies--both critical for the anticipation task. Through extensive experiments, we show that AVT obtains the best reported performance on four popular action anticipation benchmarks: EpicKitchens-55, EpicKitchens-100, EGTEA Gaze+, and 50-Salads, including outperforming all submissions to the EpicKitchens-100 CVPR'21 challenge.

updated: Thu Jun 03 2021 17:57:55 GMT+0000 (UTC)

published: Thu Jun 03 2021 17:57:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト