Learning TFIDF Enhanced Joint Embedding for Recipe-Image Cross-Modal Retrieval Service

Zhongwei Xie; Ling Liu; Yanzhao Wu; Lin Li; Luo Zhong

レシピ画像クロスモーダル検索サービスのためのTFIDF拡張ジョイント埋め込みの学習

レシピと画像の共同埋め込みを学ぶことは、調理手順における材料の多様な組成と変形のために困難であることが広く認識されています。高性能のクロスモーダル検索サービスを提供することを最終目標として、2つのモダリティ（テキストと画像）間の共通の特徴空間を学習するためのマルチモーダルセマンティクス拡張ジョイント埋め込みアプローチ（MSJE）を紹介します。私たちのMSJEアプローチには、3つの独自の機能があります。まず、レシピのタイトル、材料、調理方法からTFIDFの特徴を抽出します。 LSTMで学習した特徴とTFIDFの特徴を組み合わせて単語シーケンスの重要性を判断することにより、レシピをTFIDF加重ベクトルにエンコードして、重要な重要な用語と、対応する調理手順でそのような重要な用語がどのように使用されるかを取得します。次に、レシピTFIDF機能を、2段階のLSTMネットワークを介して抽出されたレシピシーケンス機能と組み合わせます。これは、レシピとそれに関連する画像の間の一意の関係をキャプチャするのに効果的です。第3に、TFIDFで強化されたカテゴリのセマンティクスをさらに組み込んで、画像モダリティのマッピングを改善し、クロスモーダルジョイント埋め込みの反復学習中に類似性損失関数を調整します。ベンチマークデータセットRecipe1Mでの実験は、提案されたアプローチが最先端のアプローチよりも優れていることを示しています。

It is widely acknowledged that learning joint embeddings of recipes with images is challenging due to the diverse composition and deformation of ingredients in cooking procedures. We present a Multi-modal Semantics enhanced Joint Embedding approach (MSJE) for learning a common feature space between the two modalities (text and image), with the ultimate goal of providing high-performance cross-modal retrieval services. Our MSJE approach has three unique features. First, we extract the TFIDF feature from the title, ingredients and cooking instructions of recipes. By determining the significance of word sequences through combining LSTM learned features with their TFIDF features, we encode a recipe into a TFIDF weighted vector for capturing significant key terms and how such key terms are used in the corresponding cooking instructions. Second, we combine the recipe TFIDF feature with the recipe sequence feature extracted through two-stage LSTM networks, which is effective in capturing the unique relationship between a recipe and its associated image(s). Third, we further incorporate TFIDF enhanced category semantics to improve the mapping of image modality and to regulate the similarity loss function during the iterative learning of cross-modal joint embedding. Experiments on the benchmark dataset Recipe1M show the proposed approach outperforms the state-of-the-art approaches.

updated: Mon Aug 02 2021 08:49:30 GMT+0000 (UTC)

published: Mon Aug 02 2021 08:49:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト