Language Models as Zero-shot Visual Semantic Learners

Yue Jiao; Jonathon Hare; Adam Prügel-Bennett

ゼロショット視覚意味学習者としての言語モデル

画像を豊富なセマンティック埋め込みスペースにマッピングするビジュアルセマンティック埋め込み（VSE）モデルは、オブジェクト認識とゼロショット学習のマイルストーンです。 VSEへの現在のアプローチは、静的な単語埋め込み技術に大きく依存しています。この作業では、視覚的意味理解タスクでコンテキスト化された単語埋め込みの意味情報をプローブするように設計された視覚的意味埋め込みプローブ（VSEP）を提案します。トランスフォーマー言語モデルでエンコードされた知識は、視覚的な意味理解を必要とするタスクに活用できることを示します。コンテキスト表現を備えたVSEPは、複雑なシーンの単語レベルのオブジェクト表現を構成的ゼロショット学習者として区別できます。さらに、VSEPを使用したゼロショット設定を導入して、新しい単語を新しいビジュアルカテゴリに関連付けるモデルの能力を評価します。オブジェクトの構成チェーンが短い場合、言語モデルのコンテキスト表現は静的な単語の埋め込みよりも優れていることがわかります。現在の視覚的セマンティック埋め込みモデルには、パフォーマンスを制限する相互排他性バイアスがないことがわかります。

Visual Semantic Embedding (VSE) models, which map images into a rich semantic embedding space, have been a milestone in object recognition and zero-shot learning. Current approaches to VSE heavily rely on static word em-bedding techniques. In this work, we propose a Visual Se-mantic Embedding Probe (VSEP) designed to probe the semantic information of contextualized word embeddings in visual semantic understanding tasks. We show that the knowledge encoded in transformer language models can be exploited for tasks requiring visual semantic understanding.The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner. We further introduce a zero-shot setting with VSEPs to evaluate a model's ability to associate a novel word with a novel visual category. We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short. We notice that current visual semantic embedding models lack a mutual exclusivity bias which limits their performance.

updated: Mon Jul 26 2021 08:22:55 GMT+0000 (UTC)

published: Mon Jul 26 2021 08:22:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト