Video Transformers: A Survey

Javier Selva; Anders S. Johansen; Sergio Escalera; Kamal Nasrollahi; Thomas B. Moeslund; Albert Clapés

ビデオトランスフォーマー：調査

Transformerモデルは、長距離の相互作用のモデリングで大きな成功を収めています。それにもかかわらず、それらは入力長に比例してスケーリングし、誘導バイアスを欠いています。これらの制限は、ビデオの高次元性を扱うときにさらに悪化する可能性があります。数秒から数時間に及ぶ可能性のあるビデオの適切なモデリングには、長距離の相互作用を処理する必要があります。これにより、トランスフォーマーはビデオ関連のタスクを解決するための有望なツールになりますが、いくつかの適応が必要です。ビジョンタスクのためのTransformersの進歩を研究する以前の作品はありますが、ビデオ固有のデザインの詳細な分析に焦点を当てたものはありません。この調査では、Transformersをモデルビデオデータに適応させるための主な貢献と傾向を分析して要約します。具体的には、ビデオがどのように埋め込まれ、トークン化されるかを掘り下げ、大きなCNNバックボーンを非常に広く使用して、トークンとしてのパッチとフレームの次元と優位性を減らします。さらに、一般的にシングルアテンション操作でトークンの数を減らすことにより、より長いシーケンスを処理するためにTransformerレイヤーがどのように調整されているかを調査します。また、ビデオトランスフォーマーのトレーニングに使用される自己監視損失を分析します。これは、これまでほとんど対照的なアプローチに制約されていました。最後に、他のモダリティがどのようにビデオと統合されているかを調査し、ビデオトランスフォーマーの最も一般的なベンチマーク（つまり、アクション分類）でパフォーマンス比較を行い、同等のFLOPを持ち、パラメーターの大幅な増加がない3DCNNの対応物よりも優れていることを確認します。

Transformer models have shown great success modeling long-range interactions. Nevertheless, they scale quadratically with input length and lack inductive biases. These limitations can be further exacerbated when dealing with the high dimensionality of video. Proper modeling of video, which can span from seconds to hours, requires handling long-range interactions. This makes Transformers a promising tool for solving video related tasks, but some adaptations are required. While there are previous works that study the advances of Transformers for vision tasks, there is none that focus on in-depth analysis of video-specific designs. In this survey we analyse and summarize the main contributions and trends for adapting Transformers to model video data. Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens. Furthermore, we study how the Transformer layer has been tweaked to handle longer sequences, generally by reducing the number of tokens in single attention operation. Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches. Finally, we explore how other modalities are integrated with video and conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D CNN counterparts with equivalent FLOPs and no significant parameter increase.

updated: Sun Jan 16 2022 07:31:55 GMT+0000 (UTC)

published: Sun Jan 16 2022 07:31:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト