ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber

Manuel Brac; Patrick Schramowski; Björn Deiseroth; Kristian Kersting

ILLUME: Jabber との対話による視覚言語モデルの合理化

事前トレーニング済み言語モデルからのブートストラップは、画像キャプションや視覚的質問応答などのタスクの基礎となるビジョン言語モデル (VLM) を構築するための効率的なアプローチであることが証明されています。ただし、特定の回答に対するユーザーの論理的根拠にモデルを適合させるためにそれを利用することは、不可能ではないにしても困難です。常識的な理由を引き出して強化するために、次のループを実行する ILLUME と呼ばれる反復サンプリングおよびチューニングパラダイムを提案します:微調整に使用されます。このループにより、トレーニングデータが増加し、VLM の合理化機能が徐々に切り出されます。私たちの徹底的な実験は、ILLUME が標準の監視付き微調整と競合する一方で、使用するトレーニングデータが大幅に少なく、必要なフィードバックが最小限であることを示しています。

Bootstrapping from pre-trained language models has been proven to be an efficient approach for building foundation vision-language models (VLM) for tasks such as image captioning or visual question answering. However, it is difficult-if not impossible-to utilize it to make the model conform with user's rationales for specific answers. To elicit and reinforce commonsense reasons, we propose an iterative sampling and tuning paradigm, called ILLUME, that executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides minimal feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.

updated: Wed Aug 17 2022 11:41:43 GMT+0000 (UTC)

published: Wed Aug 17 2022 11:41:43 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト