Learning Joint Embedding with Modality Alignments for Cross-Modal Retrieval of Recipes and Food Images

Zhongwei Xie; Ling Liu; Lin Li; Luo Zhong

レシピと食品画像のクロスモーダル検索のためのモダリティアライメントによる共同埋め込みの学習

この論文は、料理レシピと食品画像のクロスモーダル検索のために、JEMAとして造られたテキスト画像共同埋め込みを学習するための3層モダリティアラインメントアプローチを提示します。第1層は、用語抽出とランク付け強化シーケンスパターンを使用してLSTMネットワークを最適化することでレシピテキストの埋め込みを改善し、ResNeXt-101画像エンコーダーをwideResNet-50とword2vecを使用してカテゴリ埋め込みと組み合わせることで画像埋め込みを最適化します。第2層のモダリティ調整は、ソフトマージン最適化を使用したダブルバッチハードトリプレット損失を使用して、テキストとビジュアルのジョイント埋め込み損失関数を最適化します。 3番目のモダリティアラインメントは、2つのモダリティ固有の埋め込み関数の共同学習におけるアラインメントエラーをさらに減らすために、補助損失正則化として2つのタイプのクロスモダリティアラインメントを組み込んでいます。カテゴリベースのクロスモーダルアラインメントは、共同埋め込みの損失正則化として、画像カテゴリをレシピカテゴリにアラインメントすることを目的としています。クロスモーダルディスクリミネーターベースのアラインメントは、視覚とテキストの埋め込み分布のアラインメントを追加して、ジョイントの埋め込み損失をさらに正規化することを目的としています。 100万レシピのベンチマークデータセットRecipe1Mを使用した広範な実験は、提案されたJEMAアプローチが、画像からレシピへの検索とレシピから画像への検索の両方で、最先端のクロスモーダル埋め込み方法よりも優れていることを示しています。

This paper presents a three-tier modality alignment approach to learning text-image joint embedding, coined as JEMA, for cross-modal retrieval of cooking recipes and food images. The first tier improves recipe text embedding by optimizing the LSTM networks with term extraction and ranking enhanced sequence patterns, and optimizes the image embedding by combining the ResNeXt-101 image encoder with the category embedding using wideResNet-50 with word2vec. The second tier modality alignment optimizes the textual-visual joint embedding loss function using a double batch-hard triplet loss with soft-margin optimization. The third modality alignment incorporates two types of cross-modality alignments as the auxiliary loss regularizations to further reduce the alignment errors in the joint learning of the two modality-specific embedding functions. The category-based cross-modal alignment aims to align the image category with the recipe category as a loss regularization to the joint embedding. The cross-modal discriminator-based alignment aims to add the visual-textual embedding distribution alignment to further regularize the joint embedding loss. Extensive experiments with the one-million recipes benchmark dataset Recipe1M demonstrate that the proposed JEMA approach outperforms the state-of-the-art cross-modal embedding methods for both image-to-recipe and recipe-to-image retrievals.

updated: Mon Aug 09 2021 03:11:54 GMT+0000 (UTC)

published: Mon Aug 09 2021 03:11:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト