Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition

Ziyuan Huang; Zhiwu Qing; Xiang Wang; Yutong Feng; Shiwei Zhang; Jianwen Jiang; Zhurong Xia; Mingqian Tang; Nong Sang; Marcelo H. Ang Jr

EPIC-KITCHENS-100 アクション認識用のより強力なビデオビジョントランスフォーマーのトレーニングに向けて

ビジョントランスフォーマーの研究が最近急増しているため、画像認識、点群分類、ビデオ理解など、さまざまな困難なコンピュータービジョンアプリケーションに驚くべき可能性があることが実証されています。このペーパーでは、EPIC-KITCHENS-100 アクション認識データセットでより強力なビデオビジョントランスフォーマーをトレーニングするための実験結果を示します。具体的には、増強、解像度、初期化などのビデオビジョントランスフォーマーのトレーニングテクニックを調査します。トレーニングレシピを使用すると、単一の ViViT モデルは、EPIC-KITCHENS-100 データセットの検証セットで 47.4% のパフォーマンスを達成します。元の論文で報告されているものを 3.4% 上回っています。ビデオトランスフォーマーは、動詞 - 名詞のアクション予測タスクでの名詞の予測に特に優れていることがわかりました。これにより、ビデオトランスフォーマーの全体的なアクション予測の精度は、畳み込みトランスフォーマーよりも著しく高くなります。驚くべきことに、最高のビデオトランスフォーマーでさえ、動詞の予測に関して畳み込みネットワークよりも性能が劣ります。したがって、ビデオビジョントランスフォーマーといくつかの畳み込みビデオネットワークを組み合わせて、EPIC-KITCHENS-100 アクション認識コンペティションにソリューションを提示します。

With the recent surge in the research of vision transformers, they have demonstrated remarkable potential for various challenging computer vision applications, such as image recognition, point cloud classification as well as video understanding. In this paper, we present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset. Specifically, we explore training techniques for video vision transformers, such as augmentations, resolutions as well as initialization, etc. With our training recipe, a single ViViT model achieves the performance of 47.4% on the validation set of EPIC-KITCHENS-100 dataset, outperforming what is reported in the original paper by 3.4%. We found that video transformers are especially good at predicting the noun in the verb-noun action prediction task. This makes the overall action prediction accuracy of video transformers notably higher than convolutional ones. Surprisingly, even the best video transformers underperform the convolutional networks on the verb prediction. Therefore, we combine the video vision transformers and some of the convolutional video networks and present our solution to the EPIC-KITCHENS-100 Action Recognition competition.

updated: Wed Jun 09 2021 13:26:02 GMT+0000 (UTC)

published: Wed Jun 09 2021 13:26:02 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト