SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

Yi-Syuan Chen; Yun-Zhu Song; Cheng Yu Yeo; Bei Liu; Jianlong Fu; Hong-Han Shuai

SINC: 視覚言語タスクのための自己監視型インコンテキスト学習

大規模な事前トレーニング済みトランスフォーマーは、コンテキスト内の学習に対して興味深い能力を示します。勾配更新を行わなくても、これらのモデルは入力で提示されたデモンストレーションから新しい予測子を迅速に構築できます。最近の研究では、すでにコンテキスト内の予測を行うことができる大規模な言語モデルに視覚情報を組み込むことにより、視覚言語領域でのこの能力を促進しています。ただし、これらの方法では、テンプレートの感度や幻覚など、言語ドメインの問題が引き継がれる可能性があります。また、これらの言語モデルの規模が大きいため、計算の需要が大幅に増加し、これらのモデルの学習と操作にリソースが大量に消費されます。この目的を達成するために、「大規模な言語モデルの固有のインコンテキスト能力に依存せずに、どのようにしてインコンテキスト学習を可能にすることができるでしょうか?」という疑問を提起します。それに答えるために、簡潔で一般的なフレームワークである自己教師ありを提案します。 IN-Context 学習 (SINC) は、カスタマイズされたデモンストレーションで構成される自己教師ありプロンプトで学習するためのメタモデルを導入します。学習したモデルは、オンザフライでコンテキスト内の予測を行うために下流のタスクに転送できます。広範な実験により、 SINC は、数ショット設定の下でさまざまな視覚言語タスクにおいて勾配ベースの手法よりも優れたパフォーマンスを示します。さらに、SINC の設計は、さまざまなタスクにわたるコンテキスト内学習の利点を調査するのに役立ち、分析により、創発に不可欠な要素がさらに明らかになります。視覚言語領域における文脈内学習の。

Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: ``How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making in-context predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain.

updated: Sat Aug 19 2023 08:27:16 GMT+0000 (UTC)

published: Sat Jul 15 2023 08:33:08 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト