A Simple Long-Tailed Recognition Baseline via Vision-Language Model

Teli Ma; Shijie Geng; Mengmeng Wang; Jing Shao; Jiasen Lu; Hongsheng Li; Peng Gao; Yu Qiao

視覚言語モデルによる単純なロングテール認識ベースライン

視覚世界は当然、オープンクラスのロングテール分布を示しており、これは現代の視覚システムに大きな課題をもたらします。既存のアプローチは、クラスのリバランス戦略を実行するか、ネットワークモジュールを直接改善して問題に対処します。ただし、事前定義されたラベルの有限セットを使用してモデルをトレーニングし、監視情報を制限し、新しいインスタンスへの転送可能性を制限します。大規模な対照的な視覚言語の事前訓練における最近の進歩は、視覚認識のための新しい経路に光を当てています。オープンボキャブラリーの監視により、事前にトレーニングされた対照的な視覚言語モデルは、データ不足や目に見えない概念を処理することを約束する強力なマルチモーダル表現を学習します。視覚入力とテキスト入力の間の意味的類似性を計算することにより、視覚認識は視覚と言語のマッチング問題に変換されます。これに触発されて、私たちは、ロングテール認識のために対照的な視覚言語モデルを活用するためにBALLADを提案します。まず、特定のロングテールターゲットデータセットでの対照学習を通じて、視覚言語バックボーンの事前トレーニングを続けます。その後、バックボーンをフリーズし、さらに追加のアダプターレイヤーを使用して、リサンプリング戦略で構築されたバランスの取れたトレーニングサンプルのテールクラスの表現を強化します。広範な実験が、3つの人気のあるロングテール認識ベンチマークで実施されました。その結果、私たちのシンプルで効果的なアプローチは、新しい最先端のパフォーマンスを設定し、大きなマージンで競争力のあるベースラインを上回ります。コードはhttps://github.com/gaopengcuhk/BALLADでリリースされています。

The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems. Existing approaches either perform class re-balancing strategies or directly improve network modules to address the problem. However, they still train models with a finite set of predefined labels, limiting their supervision information and restricting their transferability to novel instances. Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition. With open-vocabulary supervisions, pretrained contrastive vision-language models learn powerful multimodal representations that are promising to handle data deficiency and unseen concepts. By calculating the semantic similarity between visual and text inputs, visual recognition is converted to a vision-language matching problem. Inspired by this, we propose BALLAD to leverage contrastive vision-language models for long-tailed recognition. We first continue pretraining the vision-language backbone through contrastive learning on a specific long-tailed target dataset. Afterward, we freeze the backbone and further employ an additional adapter layer to enhance the representations of tail classes on balanced training samples built with re-sampling strategies. Extensive experiments have been conducted on three popular long-tailed recognition benchmarks. As a result, our simple and effective approach sets the new state-of-the-art performances and outperforms competitive baselines with a large margin. Code is released at https://github.com/gaopengcuhk/BALLAD.

updated: Mon Nov 29 2021 17:49:24 GMT+0000 (UTC)

published: Mon Nov 29 2021 17:49:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト