Survey: Transformer based Video-Language Pre-training

Ludan Ruan; Qin Jin

調査：Transformerベースのビデオ言語の事前トレーニング

自然言語タスクおよびさらなるコンピュータービジョンタスクでのトランスフォーマーベースの事前トレーニング方法の成功に触発されて、研究者はビデオ処理にトランスフォーマーを適用し始めました。この調査は、ビデオ言語学習のためのトランスフォーマーベースの事前トレーニング方法に関する包括的な概要を提供することを目的としています。最初に、注意メカニズム、位置エンコーディングなどを含む背景知識としてトランスフォーマー構造を簡単に紹介します。次に、プロキシタスク、ダウンストリームタスク、および一般的に使用されるビデオ言語処理の事前トレーニングと微調整の典型的なパラダイムについて説明します。ビデオデータセット。次に、トランスフォーマーモデルをシングルストリーム構造とマルチストリーム構造に分類し、それらの革新性を強調し、それらのパフォーマンスを比較します。最後に、ビデオ言語の事前トレーニングに関する現在の課題と将来の研究の方向性を分析して説明します。

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have begun to apply transformer to video processing. This survey aims to give a comprehensive overview on transformer-based pre-training methods for Video-Language learning. We first briefly introduce the transformer tructure as the background knowledge, including attention mechanism, position encoding etc. We then describe the typical paradigm of pre-training & fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. Finally, we analyze and discuss the current challenges and possible future research directions for Video-Language pre-training.

updated: Tue Sep 21 2021 02:36:06 GMT+0000 (UTC)

published: Tue Sep 21 2021 02:36:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト