Im-Promptu: In-Context Composition from Image Prompts

Bhishma Dedhia; Michael Chang; Jake C. Snell; Thomas L. Griffiths; Niraj K. Jha

Im-Promptu: 画像プロンプトからのコンテキスト内合成

大規模な言語モデルは、少数のデモンストレーションからさまざまなタスクを解決できる数回の学習です。このタスクの暗黙的な理解は、単語トークンに対する注意メカニズムが類推推論に役割を果たしている可能性があることを示唆しています。この研究では、類推推論によって視覚刺激の構成可能な要素に対する文脈内の構成が可能になるかどうかを調査します。まず、視覚的なコンテキスト内学習者の汎化特性をテストするための 3 つのベンチマークスイートを紹介します。私たちは、アナロジーに基づいたコンテキスト内学習者の概念を形式化し、それを使用して Im-Promptu と呼ばれるメタ学習フレームワークを設計します。言語に必要なトークンの粒度は十分に確立されていますが、視覚刺激における文脈内の一般化を可能にするための適切な構成粒度は、通常は指定されていません。この目的を達成するために、Im-Promptu を使用して、ベクトル表現、パッチ表現、オブジェクトスロットなど、さまざまなレベルの構成で複数のエージェントをトレーニングします。私たちの実験では、外挿能力と構成性の程度との間のトレードオフが明らかになり、非構成表現は学習した構成ルールを目に見えない領域に拡張しますが、組み合わせタスクではパフォーマンスが低下します。パッチベースの表現では、堅牢な外挿のためにオブジェクト全体を含むパッチが必要です。同時に、クロスアテンションモジュールと組み合わせたオブジェクト中心のトークナイザーは、一貫した忠実度の高いソリューションを生成します。これらの帰納的バイアスは、構成の一般化に特に重要です。最後に、画像生成のための直感的なプログラミングインターフェイスとしての Im-Promptu の使用例を示します。

Large language models are few-shot learners that can solve diverse tasks from a handful of demonstrations. This implicit understanding of tasks suggests that the attention mechanisms over word tokens may play a role in analogical reasoning. In this work, we investigate whether analogical reasoning can enable in-context composition over composable elements of visual stimuli. First, we introduce a suite of three benchmarks to test the generalization properties of a visual in-context learner. We formalize the notion of an analogy-based in-context learner and use it to design a meta-learning framework called Im-Promptu. Whereas the requisite token granularity for language is well established, the appropriate compositional granularity for enabling in-context generalization in visual stimuli is usually unspecified. To this end, we use Im-Promptu to train multiple agents with different levels of compositionality, including vector representations, patch representations, and object slots. Our experiments reveal tradeoffs between extrapolation abilities and the degree of compositionality, with non-compositional representations extending learned composition rules to unseen domains but performing poorly on combinatorial tasks. Patch-based representations require patches to contain entire objects for robust extrapolation. At the same time, object-centric tokenizers coupled with a cross-attention module generate consistent and high-fidelity solutions, with these inductive biases being particularly crucial for compositional generalization. Lastly, we demonstrate a use case of Im-Promptu as an intuitive programming interface for image generation.

updated: Fri May 26 2023 21:10:11 GMT+0000 (UTC)

published: Fri May 26 2023 21:10:11 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト