Exploring Effective Factors for Improving Visual In-Context Learning

Yanpeng Sun; Qiang Chen; Jian Wang; Jingdong Wang; Zechao Li

視覚的インコンテキスト学習を改善するための効果的な要因を探る

In-Context Learning (ICL) は、いくつかのデモンストレーション (プロンプト) を通じて新しいタスクを理解し、モデルを調整することなく新しい入力を予測することです。 NLP では広く研究されていますが、コンピュータービジョンではまだ比較的新しい研究分野です。ビジュアルインコンテキスト学習のパフォーマンスに影響を与える要因を明らかにするために、このペーパーでは、プロンプト選択とプロンプトフュージョンが、ビジュアルコンテキスト学習の推論パフォーマンスに直接影響する 2 つの主要な要因であることを示します。プロンプトの選択は、モデルが新しいタスクを理解するのに役立つ最も適切なプロンプトまたは例を特定するプロセスです。モデルに関連するプロンプトを提供すると、より効果的かつ効率的に学習できるため、これは重要です。迅速な融合には、大規模なビジュアルモデル内のさまざまな位置からの知識を組み合わせることが含まれます。これにより、モデルはモデルのさまざまな部分に格納されているさまざまな知識を活用して、新しいタスクのパフォーマンスを向上させることができます。これらの調査結果に基づいて、ビジュアルインコンテキスト学習のための単純なフレームワーク prompt-SelF を提案します。具体的には、最初にピクセルレベルの検索方法を使用して適切なプロンプトを選択し、次にさまざまなプロンプト融合方法を使用して大規模モデルに格納されているすべての知識をアクティブにし、最後にさまざまなプロンプト融合方法から得られた予測結果をアンサンブルして、最終的な予測結果を取得します。また、単一オブジェクトのセグメンテーションと検出タスクに関する広範な実験を行って、プロンプトセルフの有効性を実証します。驚くべきことに、prompt-SelF は、初めて 1 ショットセグメンテーションで OSLSM ベースのメタ学習を上回りました。これは、視覚的なインコンテキスト学習の大きな可能性を示しています。ソースコードとモデルは、https://github.com/syp2ysy/prompt-SelF で入手できます。

The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning. Prompt selection is the process of identifying the most appropriate prompt or example to help the model understand new tasks. This is important because providing the model with relevant prompts can help it learn more effectively and efficiently. Prompt fusion involves combining knowledge from different positions within the large-scale visual model. By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks. Based these findings, we propose a simple framework prompt-SelF for visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate all the knowledge stored in the large-scale model, and finally ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. And we conduct extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, the prompt-SelF has outperformed OSLSM based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at https://github.com/syp2ysy/prompt-SelF.

updated: Mon Apr 10 2023 17:59:04 GMT+0000 (UTC)

published: Mon Apr 10 2023 17:59:04 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト