Task Bias in Vision-Language Models

Sachit Menon; Ishaan Preetam Chandratreya; Carl Vondrick

視覚言語モデルにおけるタスクバイアス

言語からの偶発的な監督は、コンピュータービジョンで多くの認識タスクを実行するように促すことができる一般的な視覚的表現を学習するための一般的なアプローチになりました。 CLIPモデルの詳細な調査を実施し、その視覚的表現がしばしば他のものよりもいくつかのタスクを解決することに強く偏っていることを示しています.さらに、表現がどのタスクに偏っているかは予測できず、画像間でほとんど一貫性がありません。このタスクバイアスを解決するために、関心のあるタスクに関連する機能に向けて表現を導く視覚的なプロンプトを学習する方法を示します。私たちの結果は、これらの視覚的プロンプトが入力画像から独立している可能性があり、それでも視覚的表現を目的のタスクに向けるための調整メカニズムを効果的に提供することを示しています。

Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest. Our results show that these visual prompts can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.

updated: Thu Dec 08 2022 17:10:31 GMT+0000 (UTC)

published: Thu Dec 08 2022 17:10:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト