iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition

Yixuan Wei; Yue Cao; Zheng Zhang; Zhuliang Yao; Zhenda Xie; Han Hu; Baining Guo

iCAR：視覚認識のための画像分類と画像テキストの位置合わせの橋渡し

事前定義されたカテゴリによって画像を分類する画像分類は、過去10年間、視覚表現学習への主要なアプローチでした。ただし、画像とテキストの位置合わせによる視覚学習は、特にゼロショット認識で有望なパフォーマンスを示すようになりました。これらの2つの学習タスクは補完的であると信じており、より良い視覚学習のためにそれらを組み合わせることをお勧めします。ナイーブなマルチタスク学習による浅い融合ではなく、2つの学習タスクを効果的に橋渡しする3つの適応を備えた深い融合方法を提案します。最初に、画像分類の以前の一般的な方法である線形分類器を、同等のパフォーマンスを示す正弦分類器で変更します。次に、画像分類の問題を、パラメトリックカテゴリ分類器の重みの学習からメタネットワークとしてのテキストエンコーダの学習に変換して、カテゴリ分類器の重みを生成します。学習したテキストエンコーダは、画像分類と画像テキストの配置の間で共有されます。第3に、クラス間の混乱を避け、分類方法を画像とテキストの配置に近づけるために、各クラス名を説明で強化します。このディープフュージョンアプローチは、Kornblith 12データセットベンチマークなどのゼロショット/数ショット画像分類からダウンストリームタスクまで、個々の学習または浅いフュージョンアプローチよりもさまざまな視覚認識タスクおよびセットアップで優れたパフォーマンスを発揮することを証明します。微調整およびオープンボキャブラリー設定でのアクション認識、セマンティックセグメンテーション、およびオブジェクト検出の機能。コードはhttps://github.com/weiyx16/iCARで入手できます。

Image classification, which classifies images by pre-defined categories, has been the dominant approach to visual representation learning over the last decade. Visual learning through image-text alignment, however, has emerged to show promising performance, especially for zero-shot recognition. We believe that these two learning tasks are complementary, and suggest combining them for better visual learning. We propose a deep fusion method with three adaptations that effectively bridge two learning tasks, rather than shallow fusion through naive multi-task learning. First, we modify the previous common practice in image classification, a linear classifier, with a cosine classifier which shows comparable performance. Second, we convert the image classification problem from learning parametric category classifier weights to learning a text encoder as a meta network to generate category classifier weights. The learnt text encoder is shared between image classification and image-text alignment. Third, we enrich each class name with a description to avoid confusion between classes and make the classification method closer to the image-text alignment. We prove that this deep fusion approach performs better on a variety of visual recognition tasks and setups than the individual learning or shallow fusion approach, from zero-shot/few-shot image classification, such as the Kornblith 12-dataset benchmark, to downstream tasks of action recognition, semantic segmentation, and object detection in fine-tuning and open-vocabulary settings. The code will be available at https://github.com/weiyx16/iCAR.

updated: Fri Apr 22 2022 15:27:21 GMT+0000 (UTC)

published: Fri Apr 22 2022 15:27:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト