Parameter Efficient Multimodal Transformers for Video Representation Learning

Sangho Lee; Youngjae Yu; Gunhee Kim; Thomas Breuel; Jan Kautz; Yale Song

ビデオ表現学習のためのパラメータ効率の高いマルチモーダルトランスフォーマー

言語ドメインでのトランスフォーマーの最近の成功は、新しいビジュアルモデルがすでに事前にトレーニングされた言語モデルと並行してトレーニングされるマルチモーダル設定にそれを適応させる動機となっています。ただし、トランスフォーマーからの過剰なメモリ要件のため、既存の作業では通常、言語モデルが修正され、ビジョンモジュールのみがトレーニングされます。これにより、クロスモーダル情報をエンドツーエンドで学習する機能が制限されます。この作業では、オーディオビジュアルビデオ表現学習のコンテキストでマルチモーダルトランスフォーマーのパラメーターを減らすことに焦点を当てます。レイヤーとモダリティ間でトランスフォーマーのパラメーターを共有することにより、高いメモリ要件を軽減します。モデルが各モダリティのダイナミクスを個別におよび一緒に学習するように、Transformerをモダリティ固有の部分とモダリティ共有部分に分解し、低ランク近似に基づく新しいパラメーター共有スキームを提案します。私たちのアプローチは、トランスフォーマーのパラメーターを最大97％削減し、モデルを最初からエンドツーエンドでトレーニングできることを示しています。また、モデルがトランスフォーマーと一緒に学習するCNN埋め込み空間で測定されたインスタンスの類似性に基づいて、ネガティブサンプリングアプローチを提案します。私たちのアプローチを示すために、Kinetics-700からの30秒クリップ（480フレーム）でモデルを事前トレーニングし、それを視聴覚分類タスクに転送します。

The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters of the Transformers up to 97%, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks.

updated: Wed Sep 22 2021 16:19:39 GMT+0000 (UTC)

published: Tue Dec 08 2020 00:16:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト