Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval

Mustafa Shukor; Nicolas Thome; Matthieu Cord

クロスモーダル食品検索のためのビジョンと構造化言語の事前トレーニング

Vision-Language Pretraining (VLP) および Foundation モデルは、一般的なベンチマークで SoTA パフォーマンスを達成するための頼りになるレシピです。ただし、より構造化された入力データを使用して、調理アプリケーションなどのより複雑な視覚言語タスクにこれらの強力な手法を活用することは、まだほとんど調査されていません。この作業では、これらの手法を構造化テキストベースの計算料理タスクに活用することを提案します。 VLPCook と呼ばれる私たちの戦略は、まず既存の画像とテキストのペアを画像と構造化テキストのペアに変換します。これにより、結果のデータセットの構造化されたデータに適応した VLP 目的を使用して VLPCook モデルを事前トレーニングし、その後、ダウンストリームの計算クッキングタスクで微調整することができます。微調整中、ローカルおよびグローバルのテキストコンテキストを提供するために、事前トレーニング済みの基盤モデル (CLIP など) を活用して、ビジュアルエンコーダーも強化します。 VLPCook は、大規模な Recipe1M データセットでのクロスモーダル食品検索のタスクで、現在の SoTA を大幅に上回っています (+3.3 Recall@1 絶対改善)。特に Recipe1M+ データセットで、VLP の重要性を検証するためにさらに実験を行います。最後に、ROCO データセットの医療ドメインなどの構造化テキストを使用して、他のタスク (食品認識など) およびドメインへのアプローチの一般化を検証します。コードはこちらから入手できます: https://github.com/mshukor/VLPCook

Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook, first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the strutured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local and global textual context. VLPCook outperforms current SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food Retrieval on the large Recipe1M dataset. We conduct further experiments on VLP to validate their importance, especially on the Recipe1M+ dataset. Finally, we validate the generalization of the approach to other tasks (i.e, Food Recognition) and domains with structured text such as the Medical domain on the ROCO dataset. The code is available here: https://github.com/mshukor/VLPCook

updated: Wed Mar 15 2023 19:48:24 GMT+0000 (UTC)

published: Thu Dec 08 2022 13:37:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト