Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning

Ricardo Guerrero; Hai Xuan Pham; Vladimir Pavlovic

クロスモーダル検索と合成（X-MRS）：共有表現学習におけるモダリティギャップの解消

計算食品分析（CFA）は、当然、特定の食品のマルチモーダル証拠（画像、レシピテキストなど）を必要とします。CFAを可能にする鍵は、複数の共同表現を作成することを目的としたマルチモーダル共有表現学習です。データのビュー（テキストと画像）。この作業では、食品データに存在する広大な意味の豊かさを維持する食品ドメインクロスモーダル共有表現学習の方法を提案します。私たちの提案する方法は、従来の画像埋め込みアーキテクチャと組み合わせた効果的なトランスベースの多言語レシピエンコーダを採用しています。ここでは、不完全な多言語翻訳を使用してモデルを効果的に正規化すると同時に、複数の言語とアルファベットにまたがる機能を追加することを提案します。公開Recipe1Mデータセットの実験的分析は、提案された方法を介して学習された表現が、検索タスクに関する現在の最先端（SOTA）を大幅に上回っていることを示しています。さらに、学習された表現の表現力は、レシピの埋め込みを条件とする生成的食品画像合成モデルを通じて実証されます。合成された画像は、ペアのサンプルの視覚的外観を効果的に再現できます。これは、学習された表現がテキストレシピとその視覚的コンテンツの両方の共同セマンティクスをキャプチャし、モダリティギャップを狭めることを示します。

Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce the visual appearance of paired samples, indicating that the learned representation captures the joint semantics of both the textual recipe and its visual content, thus narrowing the modality gap.

updated: Thu Sep 30 2021 10:53:33 GMT+0000 (UTC)

published: Wed Dec 02 2020 17:27:00 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト