SVFormer: Semi-supervised Video Transformer for Action Recognition

Zhen Xing; Qi Dai; Han Hu; Jingjing Chen; Zuxuan Wu; Yu-Gang Jiang

SVFormer: アクション認識用の半教師付きビデオトランスフォーマー

ビデオ注釈のコストが高いため、半教師付きアクション認識は困難ですが重要なタスクです。既存のアプローチは主に畳み込みニューラルネットワークを使用していますが、現在の革新的なビジョントランスフォーマーモデルはあまり検討されていません。このホワイトペーパーでは、アクション認識のための SSL 設定でのトランスフォーマーモデルの使用を調査します。この目的のために、SVFormer を導入します。SVFormer は、ラベルのないビデオサンプルに対処するために安定した疑似ラベルフレームワーク (つまり、EMA-Teacher) を採用しています。半教師あり画像分類にはさまざまなデータ拡張が有効であることが示されていますが、ビデオ認識では一般に限られた結果しか得られません。したがって、時間軸上で一貫したマスクされたトークンを持つマスクを介してビデオクリップが混合されるビデオデータ用に調整された、新しい拡張戦略である Tube TokenMix を導入します。さらに、ビデオの複雑な時間的変動をカバーする時間ワーピング拡張を提案します。これにより、選択したフレームがクリップ内のさまざまな時間的長さに引き伸ばされます。 Kinetics-400、UCF-101、および HMDB-51 の 3 つのデータセットでの広範な実験により、SVFormer の利点が検証されました。特に、SVFormer は、Kinetics-400 の 1% のラベリングレートの下で、より少ないトレーニングエポックで最先端の 31.5% を上回っています。私たちの方法が強力なベンチマークとして機能し、Transformer ネットワークを使用した半教師付きアクション認識の将来の検索を促進できることを願っています。

Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie, EMA-Teacher) to cope with unlabeled video samples. While a wide range of data augmentations have been shown effective for semi-supervised image classification, they generally produce limited results for video recognition. We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis. In addition, we propose a temporal warping augmentation to cover the complex temporal variation in videos, which stretches selected frames to various temporal durations in the clip. Extensive experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400. Our method can hopefully serve as a strong benchmark and encourage future search on semi-supervised action recognition with Transformer networks.

updated: Wed Nov 23 2022 18:58:42 GMT+0000 (UTC)

published: Wed Nov 23 2022 18:58:42 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト