OvarNet: Towards Open-vocabulary Object Attribute Recognition

Keyan Chen; Xiaolong Jiang; Yao Hu; Xu Tang; Yan Gao; Jianqi Chen; Weidi Xie

OvarNet: オープン語彙のオブジェクト属性認識に向けて

この論文では、オープン語彙シナリオに似た、トレーニング段階で手動の注釈が提供されていないものであっても、画像内のオブジェクトの検出とそれらの視覚的属性の推定を同時に行う問題を検討します。この目標を達成するために、次の貢献を行います。(i) CLIP-Attr と呼ばれる、オープン語彙オブジェクトの検出と属性分類のための単純な 2 段階のアプローチから始めます。候補オブジェクトは最初にオフライン RPN で提案され、後でセマンティックカテゴリと属性に分類されます。 (ii) 利用可能なすべてのデータセットを結合し、フェデレーテッド戦略を使用してトレーニングし、CLIP モデルを微調整して、視覚的表現を属性に合わせます。さらに、無料で利用できるオンラインの画像とキャプションのペアを弱教師付き学習で活用することの有効性を調査します。 (iii) 効率を追求するために、Faster-RCNN タイプのモデルを知識の蒸留でエンドツーエンドでトレーニングします。これは、クラスにとらわれないオブジェクトの提案と、テキストエンコーダーから生成された分類器を使用したセマンティックカテゴリと属性の分類を実行します。最後に、(iv) VAW、MS-COCO、LSA、および OVAD データセットに対して広範な実験を行い、セマンティックカテゴリと属性の認識が視覚的なシーンの理解を補完することを示します。 2 つのタスクを独立して扱うアプローチであり、新しい属性とカテゴリに対する強力な一般化能力を示します。

In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories.

updated: Mon Jan 23 2023 15:59:29 GMT+0000 (UTC)

published: Mon Jan 23 2023 15:59:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト