M&M Mix: A Multimodal Multiview Transformer Ensemble

Xuehan Xiong; Anurag Arnab; Arsha Nagrani; Cordelia Schmid

M＆M Mix：マルチモーダルマルチビュートランスフォーマーアンサンブル

このレポートでは、2022年のEpic-Kitchensアクション認識チャレンジに対する当社の勝利ソリューションの背後にあるアプローチについて説明します。私たちのアプローチは、私たちの最近の仕事であるビデオ認識用マルチビュートランスフォーマー（MTV）に基づいており、マルチモーダル入力に適応しています。最終的な提出物は、バックボーンのサイズと入力モダリティを変化させるマルチモーダルMTV（M＆M）モデルのアンサンブルで構成されています。私たちのアプローチは、アクションクラスのテストセットで52.8％のトップ1の精度を達成しました。これは、昨年の受賞作品よりも4.1％高くなっています。

This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.

updated: Mon Jun 20 2022 15:31:13 GMT+0000 (UTC)

published: Mon Jun 20 2022 15:31:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト