Cross-Modal Discrete Representation Learning

Alexander H. Liu; SouYoung Jin; Cheng-I Jeff Lai; Andrew Rouditchenko; Aude Oliva; James Glass

クロスモーダル離散表現学習

表現学習の最近の進歩により、ビデオ、テキスト、オーディオなどのさまざまなモダリティからの情報を単一の高レベルの埋め込みベクトルで表現できることが実証されています。この作業では、視覚オブジェクトや話し言葉で表される概念やイベントなど、さまざまなモダリティにわたってより細かいレベルの粒度をキャプチャする表現を学習できる自己教師あり学習フレームワークを紹介します。私たちのフレームワークは、異なるモダリティ間で共有されるベクトル量子化を介して作成された離散化された埋め込み空間に依存しています。共有埋め込みスペースを超えて、クロスモーダルオブジェクト/アクションのローカリゼーションを直接監視せずに実行できるように、異なるビュー（モダリティ）からの表現を個別の埋め込みスペース全体に同様の分布にするクロスモーダルコードマッチング目標を提案します。私たちの実験では、提案された離散化されたマルチモーダル細粒度表現（たとえば、ピクセル/単語/フレーム）が、クロスモーダル検索タスクのパフォーマンスを向上させるために、高レベルの要約表現（たとえば、ビデオ/文/波形）を補完できることを示します。。また、離散化された表現では、個々のクラスターを使用して、モダリティ全体で同じセマンティックコンセプトを表現していることもわかります。

Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments we show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.

updated: Thu Jun 10 2021 00:23:33 GMT+0000 (UTC)

published: Thu Jun 10 2021 00:23:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト