Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training

Saurabh Sahu; Palash Goyal

ゲート付きマルチレベル注意と時間的敵対的トレーニングを使用したビデオ理解のためのトランスフォーマーの強化

Transformerモデルの導入により、特にテキストドメインで、シーケンスモデリングが大幅に進歩しました。ただし、ビデオの理解のための注意ベースのモデルの使用はまだ比較的未踏です。この論文では、注意ベースのモデルのビデオへの適用性を高めるために、Gated Adversarial Transformer（GAT）を紹介します。 GATは、マルチレベルのアテンションゲートを使用して、ローカルおよびグローバルコンテキストに基づいてフレームの関連性をモデル化します。これにより、モデルはさまざまな粒度でビデオを理解できます。さらに、GATは、モデルの一般化を改善するために敵対的なトレーニングを使用します。敵対的な例に対する注意モジュールのロバスト性を改善するために、時間的注意正則化スキームを提案します。ビデオ分類のタスクに関する大規模なYoutTube-8MデータセットでのGATのパフォーマンスを示します。さらに、改善を示すために、定量的および定性的分析とともにアブレーション研究を示します。

The introduction of Transformer model has led to tremendous advancements in sequence modeling, especially in text domain. However, the use of attention-based models for video understanding is still relatively unexplored. In this paper, we introduce Gated Adversarial Transformer (GAT) to enhance the applicability of attention-based models to videos. GAT uses a multi-level attention gate to model the relevance of a frame based on local and global contexts. This enables the model to understand the video at various granularities. Further, GAT uses adversarial training to improve model generalization. We propose temporal attention regularization scheme to improve the robustness of attention modules to adversarial examples. We illustrate the performance of GAT on the large-scale YoutTube-8M data set on the task of video categorization. We further show ablation studies along with quantitative and qualitative analysis to showcase the improvement.

updated: Thu Mar 18 2021 06:39:09 GMT+0000 (UTC)

published: Thu Mar 18 2021 06:39:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト