Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Hexiang Hu; Yi Luan; Yang Chen; Urvashi Khandelwal; Mandar Joshi; Kenton Lee; Kristina Toutanova; Ming-Wei Chang

オープンドメインの視覚エンティティ認識: 何百万ものウィキペディアエンティティの認識に向けて

CLIP や PaLI などの大規模なマルチモーダル事前トレーニングモデルは、さまざまな視覚領域やタスクで強力な一般化を示します。ただし、既存の画像分類ベンチマークは、特定のドメイン (屋外画像など) または特定のタスク (植物種の分類など) での認識を評価することが多く、事前にトレーニングされた基本モデルが普遍的な視覚認識エンジンであるかどうかを評価するには不十分です。これに対処するために、モデルがテキストクエリに関してウィキペディアエンティティに画像をリンクする必要がある場合、オープンドメインビジュアルエンティティ認識 (OVEN) のタスクを正式に提示します。 OVEN-Wiki は、14 の既存のデータセットを再利用して、すべてのラベルを 1 つのラベルスペース (ウィキペディアエンティティ) に基づいて構築します。 OVEN は、モデルに 600 万の可能なウィキペディアエンティティから選択するように要求し、最大数のラベルを持つ一般的な視覚認識ベンチマークにします。最先端の事前トレーニング済みモデルに関する調査では、大規模なラベル空間に一般化する際に大きな余裕があることが明らかになりました。 PaLI ベースの自己回帰視覚認識モデルが、微調整中に一度も見られなかったウィキペディアのエンティティでさえ、驚くほどうまく機能することを示します。また、既存の事前トレーニング済みモデルがさまざまな長所を生むこともわかりました。PaLI ベースのモデルは全体的なパフォーマンスが向上しますが、CLIP ベースのモデルはテールエンティティの認識に優れています。

Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong generalization on various visual domains and tasks. However, existing image classification benchmarks often evaluate recognition on a specific domain (e.g., outdoor images) or a specific task (e.g., classifying plant species), which falls short of evaluating whether pre-trained foundational models are universal visual recognizers. To address this, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels. Our study on state-of-the-art pre-trained models reveals large headroom in generalizing to the massive-scale label space. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning. We also find existing pretrained models yield different strengths: while PaLI-based models obtain higher overall performance, CLIP-based models are better at recognizing tail entities.

updated: Wed Feb 22 2023 05:31:26 GMT+0000 (UTC)

published: Wed Feb 22 2023 05:31:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト