Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

William Berrios; Gautam Mittal; Tristan Thrush; Douwe Kiela; Amanpreet Singh

見ることができる言語モデルに向けて: 自然言語のレンズを通したコンピュータービジョン

私たちは、大規模言語モデル (LLM) の力を活用してコンピュータービジョンの問題に取り組むためのモジュール型アプローチである LENS を提案します。私たちのシステムは、言語モデルを使用して、画像に関する網羅的な情報を提供する一連の独立した高度に記述的なビジョンモジュールからの出力を推論します。私たちは、視覚と言語の問題だけでなく、ゼロショットおよび少数ショットの物体認識などの純粋なコンピュータービジョン設定に関するアプローチを評価します。 LENS は既製の LLM に適用でき、LENS を備えた LLM は、マルチモーダルトレーニングをまったく行わなくても、はるかに大規模で洗練されたシステムに対して高い競争力を発揮することがわかりました。 https://github.com/ContextualAI/lens でコードをオープンソース化し、インタラクティブなデモを提供します。

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo.

updated: Wed Jun 28 2023 17:57:10 GMT+0000 (UTC)

published: Wed Jun 28 2023 17:57:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト