Higher Order Recurrent Space-Time Transformer for Video Action Prediction

Tsung-Ming Tai; Giuseppe Fiameni; Cheng-Kuang Lee; Oswald Lanz

ビデオアクション予測のための高次再帰時空トランスフォーマー

ビジュアルエージェントに予測機能を提供することは、大規模なビデオインテリジェンスに向けた重要なステップです。このための主なモデリングパラダイムはシーケンス学習であり、主にLSTMを介して実装されます。フィードフォワードTransformerアーキテクチャは、言語処理のMLアプリケーションで、また部分的にコンピュータビジョンでも、繰り返し発生するモデル設計に取って代わりました。このホワイトペーパーでは、ビデオ予測タスクに対するTransformerスタイルのアーキテクチャの競争力について調査します。そのために、HORSTを提案します。これは、コア要素がビデオの自己注意の時空間分解である、新しい高次の反復層設計です。 HORSTは、Something-Somethingの早期アクション認識とEPIC-Kitchensアクション予測で最先端の競争力のあるパフォーマンスを達成し、自己注意の繰り返しの高次設計に起因する予測能力の証拠を示しています。

Endowing visual agents with predictive capability is a key step towards video intelligence at scale. The predominant modeling paradigm for this is sequence learning, mostly implemented through LSTMs. Feed-forward Transformer architectures have replaced recurrent model designs in ML applications of language processing and also partly in computer vision. In this paper we investigate on the competitiveness of Transformer-style architectures for video predictive tasks. To do so we propose HORST, a novel higher order recurrent layer design whose core element is a spatial-temporal decomposition of self-attention for video. HORST achieves state of the art competitive performance on Something-Something early action recognition and EPIC-Kitchens action anticipation, showing evidence of predictive capability that we attribute to our recurrent higher order design of self-attention.

updated: Tue Sep 21 2021 05:25:42 GMT+0000 (UTC)

published: Sat Apr 17 2021 23:51:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト