Parameter Efficient Multimodal Transformers for Video Representation Learning

Sangho Lee; Youngjae Yu; Gunhee Kim; Thomas Breuel; Jan Kautz; Yale Song

ビデオ表現学習のためのパラメータ効率の高いマルチモーダルトランスフォーマ

言語ドメインでのトランスフォーマーの最近の成功は、新しいビジュアルモデルがすでに事前にトレーニングされた言語モデルと並行してトレーニングされるマルチモーダル設定にそれを適応させる動機を与えました。ただし、トランスフォーマーからの過剰なメモリ要件のため、既存の作業では通常、言語モデルが修正され、ビジョンモジュールのみがトレーニングされます。これにより、クロスモーダル情報をエンドツーエンドで学習する機能が制限されます。この作業では、オーディオビジュアルビデオ表現学習のコンテキストでマルチモーダルトランスフォーマーのパラメーターを減らすことに焦点を当てます。レイヤーやモダリティ間でトランスフォーマーの重みを共有することで、高いメモリ要件を緩和します。モデルが各モダリティのダイナミクスを個別におよび一緒に学習するように、Transformerをモダリティ固有の部分とモダリティ共有部分に分解し、低ランク近似に基づく新しいパラメーター共有スキームを提案します。私たちのアプローチはパラメーターを最大80％削減し、モデルを最初からエンドツーエンドでトレーニングできることを示しています。また、モデルがトランスフォーマーで学習したCNN埋め込み空間で測定されたインスタンスの類似性に基づくネガティブサンプリングアプローチを提案します。私たちのアプローチを示すために、Kinetics-700からの30秒のクリップでモデルを事前トレーニングし、それを視聴覚分類タスクに転送します。

The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the weights of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters up to 80%, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.

updated: Tue Dec 08 2020 00:16:13 GMT+0000 (UTC)

published: Tue Dec 08 2020 00:16:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト