PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Valerii Likhosherstov; Anurag Arnab; Krzysztof Choromanski; Mario Lucic; Yi Tay; Adrian Weller; Mostafa Dehghani

PolyViT：画像、ビデオ、オーディオでのビジョントランスフォーマーの共同トレーニング

学習可能なパラメーターのほとんどすべてを共有しながら、複数のモダリティとデータセットを処理できる単一のトランスフォーマーモデルをトレーニングできますか？この質問に答える画像、音声、ビデオでトレーニングされたモデルであるPolyViTを紹介します。単一のモダリティでさまざまなタスクを共同トレーニングすることにより、個々のタスクの精度を向上させ、5つの標準的なビデオおよびオーディオ分類データセットで最先端の結果を達成することができます。複数のモダリティとタスクでPolyViTを共同トレーニングすると、さらにパラメーター効率の高いモデルが得られ、複数のドメインにわたって一般化される表現を学習します。さらに、データセットの組み合わせごとにハイパーパラメータを調整する必要がないため、共同トレーニングの実装が簡単で実用的であることを示しますが、標準の単一タスクトレーニングからのハイパーパラメータを簡単に適応させることができます。

Can we train a single transformer model capable of processing multiple modalities and datasets, whilst sharing almost all of its learnable parameters? We present PolyViT, a model trained on image, audio and video which answers this question. By co-training different tasks on a single modality, we are able to improve the accuracy of each individual task and achieve state-of-the-art results on 5 standard video- and audio-classification datasets. Co-training PolyViT on multiple modalities and tasks leads to a model that is even more parameter-efficient, and learns representations that generalize across multiple domains. Moreover, we show that co-training is simple and practical to implement, as we do not need to tune hyperparameters for each combination of datasets, but can simply adapt those from standard, single-task training.

updated: Thu Nov 25 2021 10:01:05 GMT+0000 (UTC)

published: Thu Nov 25 2021 10:01:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト