Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Xinlong Wang; Wen Wang; Yue Cao; Chunhua Shen; Tiejun Huang

イメージはイメージで語る: 状況に応じた視覚学習のためのジェネラリストペインター

NLP の新しいパラダイムであるコンテキスト内学習により、モデルはほんの一握りのプロンプトと例でさまざまなタスクに迅速に適応できます。しかし、コンピュータービジョンでは、タスクの出力表現が大きく異なるという点で、コンテキスト内学習の難しさがあります。したがって、ビジョンモデルが理解してドメイン外に転送できる汎用タスクプロンプトを定義する方法が不明です。タスク。この作業では、「イメージ」中心のソリューションでこれらの障害に対処するジェネラリストモデルである Painter を紹介します。つまり、コアビジョンタスクの出力をイメージとして再定義し、タスクプロンプトもイメージとして指定します。このアイデアにより、トレーニングプロセスは非常にシンプルになり、入力画像と出力画像のペアのステッチに対して標準のマスクされた画像モデリングを実行します。これにより、モデルは可視画像パッチで条件付けされたタスクを実行できるようになります。したがって、推論中に、入力条件と同じタスクからの入力画像と出力画像のペアを採用して、どのタスクを実行するかを示すことができます。ベルとホイッスルがなければ、当社のジェネラリスト Painter は、高レベルの視覚的理解から低レベルの画像処理までの 7 つの代表的な視覚タスクで、確立されたタスク固有のモデルと比較して競争力のあるパフォーマンスを達成できます。 Painter は、いくつかの困難なタスクにおいて、最近のジェネラリストモデルよりも大幅に優れています。驚いたことに、私たちのモデルは、オープンカテゴリのキーポイント検出やオブジェクトのセグメンテーションなど、トレーニングデータには存在しないドメイン外のタスクを完了する機能を示しており、コンテキスト内学習の強力なタスク転送可能性を検証しています。

In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. Painter significantly outperforms recent generalist models on several challenging tasks. Surprisingly, our model shows capabilities of completing out-of-domain tasks, which do not exist in the training data, such as open-category keypoint detection and object segmentation, validating the powerful task transferability of in-context learning.

updated: Mon Dec 05 2022 18:59:50 GMT+0000 (UTC)

published: Mon Dec 05 2022 18:59:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト