GlideNet: Global, Local and Intrinsic based Dense Embedding NETwork for Multi-category Attributes Prediction

Kareem Metwaly; Aerin Kim; Elliot Branson; Vishal Monga

GlideNet：マルチカテゴリ属性予測のためのグローバル、ローカル、および本質ベースの高密度埋め込みネットワーク

オブジェクトカテゴリに属性（色、形、状態、アクションなど）を付けることは、コンピュータビジョンの重要な問題です。属性予測は最近エキサイティングな進歩を遂げており、マルチラベル分類問題として定式化されることがよくあります。ただし、重要な課題は次のとおりです。1）複数のカテゴリにわたる多様な属性の予測、2）属性のモデリング-カテゴリの依存関係、3）グローバルシーンとローカルシーンの両方のコンテキストのキャプチャ、4）ピクセル数の少ないオブジェクトの属性の予測。これらの問題に対処するために、3つの異なる特徴抽出器を含むGlideNetという名前の新しいマルチカテゴリ属性予測ディープアーキテクチャを提案します。グローバル特徴抽出器はシーンに存在するオブジェクトを認識しますが、ローカル特徴抽出器は対象のオブジェクトの周囲の領域に焦点を合わせます。一方、固有の特徴抽出器は、インフォームドコンボリューションと呼ばれる標準の畳み込みの拡張を使用して、ピクセル数の少ないオブジェクトの特徴を取得します。 GlideNetは、バイナリマスクとその自己学習カテゴリ埋め込みを備えたゲーティングメカニズムを使用して、高密度の埋め込みを組み合わせます。まとめると、Global-Local-Intrinsicブロックは、関心のあるローカルオブジェクトの特性に注意を払いながら、シーンのグローバルコンテキストを理解します。最後に、組み合わされた機能を使用して、インタプリタが属性を予測し、出力の長さがカテゴリによって決定されるため、不要な属性が削除されます。 GlideNetは、大規模な属性予測のために、最近の2つの挑戦的なデータセット（VAWとCAR）で説得力のある結果を達成できます。たとえば、平均リコール（mR）メトリックでは、最先端技術よりも5％以上のゲインが得られます。 GlideNetの利点は、ピクセル数が少ないオブジェクトの属性や、グローバルなコンテキストの理解を必要とする属性を予測する場合に特に顕著です。最後に、GlideNetが飢えた現実世界のシナリオのトレーニングに優れていることを示します。

Attaching attributes (such as color, shape, state, action) to object categories is an important computer vision problem. Attribute prediction has seen exciting recent progress and is often formulated as a multi-label classification problem. Yet significant challenges remain in: 1) predicting diverse attributes over multiple categories, 2) modeling attributes-category dependency, 3) capturing both global and local scene context, and 4) predicting attributes of objects with low pixel-count. To address these issues, we propose a novel multi-category attribute prediction deep architecture named GlideNet, which contains three distinct feature extractors. A global feature extractor recognizes what objects are present in a scene, whereas a local one focuses on the area surrounding the object of interest. Meanwhile, an intrinsic feature extractor uses an extension of standard convolution dubbed Informed Convolution to retrieve features of objects with low pixel-count. GlideNet uses gating mechanisms with binary masks and its self-learned category embedding to combine the dense embeddings. Collectively, the Global-Local-Intrinsic blocks comprehend the scene's global context while attending to the characteristics of the local object of interest. Finally, using the combined features, an interpreter predicts the attributes, and the length of the output is determined by the category, thereby removing unnecessary attributes. GlideNet can achieve compelling results on two recent and challenging datasets -- VAW and CAR -- for large-scale attribute prediction. For instance, it obtains more than 5% gain over state of the art in the mean recall (mR) metric. GlideNet's advantages are especially apparent when predicting attributes of objects with low pixel counts as well as attributes that demand global context understanding. Finally, we show that GlideNet excels in training starved real-world scenarios.

updated: Mon Mar 14 2022 19:10:22 GMT+0000 (UTC)

published: Mon Mar 07 2022 00:32:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト