Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

Shuhuai Ren; Aston Zhang; Yi Zhu; Shuai Zhang; Shuai Zheng; Mu Li; Alex Smola; Xu Sun

オープンボキャブラリー視覚認識のための 20,000 クラスによる迅速な事前トレーニング

この作業は、視覚言語モデルの迅速な事前トレーニング方法である POMP を提案します。メモリと計算が効率的であるため、POMP を使用すると、学習したプロンプトが意味情報を凝縮して、2 万以上のクラスを持つ視覚的な概念の豊富なセットを作成できます。事前トレーニングが完了すると、転送可能な強力な機能を備えたプロンプトを、画像分類、セマンティックセグメンテーション、オブジェクト検出などのさまざまな視覚認識タスクに直接プラグインして、ゼロショット方式で認識パフォーマンスを向上させることができます。経験的評価は、POMP が 21 のダウンストリームデータセットで最先端のパフォーマンスを達成することを示しています。たとえば、10 の分類データセットで 67.0% の平均精度 (CoOp と比較して +3.1%)、オープン語彙の Pascal VOC セグメンテーションで 84.4 hIoU (+6.9%) です。 ZSSeg との比較)。

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 downstream datasets, e.g., 67.0% average accuracy on 10 classification dataset (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg).

updated: Mon Apr 10 2023 16:45:30 GMT+0000 (UTC)

published: Mon Apr 10 2023 16:45:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト