KE-RCNN: Unifying Knowledge based Reasoning into Part-level Attribute Parsing

Xuanhan Wang; Jingkuan Song; Xiaojia Chen; Lechao Cheng; Lianli Gao; Heng Tao Shen

KE-RCNN：知識ベースの推論をパーツレベルの属性解析に統合

パーツレベルの属性解析は基本的ですが困難なタスクであり、身体パーツの説明可能な詳細を提供するには、領域レベルの視覚的理解が必要です。ほとんどの既存のアプローチは、属性予測ヘッドを備えた地域畳み込みニューラルネットワーク（RCNN）を2ステージ検出器に追加することでこの問題に対処します。この検出器では、身体部分の属性がローカルごとの部分ボックスから識別されます。ただし、視覚的な手がかりが制限されたローカルごとのパーツボックス（つまり、パーツの外観のみ）は、ボディパーツの属性がそれらの間の包括的な関係に大きく依存しているため、満足のいく解析結果につながりません。この記事では、暗黙知を含む豊富な知識を活用して属性を識別するための知識埋め込みRCNN（KE-RCNN）を提案します（たとえば、シャツの「ヒップの上」の属性にはシャツの視覚的/幾何学的関係が必要です） -ヒップ）および形式知（たとえば、「ショーツ」の部分は「フーディー」または「ライニング」の属性を持つことはできません）。具体的には、KE-RCNNは、暗黙知ベースのエンコーダー（IK-En）と形式知ベースのデコーダー（EK-De）の2つの新しいコンポーネントで構成されています。前者は、パーツとパーツの関係コンテキストをパーツボックスにエンコードすることによってパーツレベルの表現を強化するように設計されており、後者は、パーツと属性の関係に関する事前知識のガイダンスを使用して属性をデコードするために提案されています。このように、KE-RCNNはプラグアンドプレイであり、Attribute-RCNN、Cascade-RCNN、HRNetベースのRCNN、SwinTransformerベースのRCNNなどの任意の2ステージ検出器に統合できます。 FashionpediaとKinetics-TPSなどの2つの挑戦的なベンチマークで実施された広範な実験は、KE-RCNNの有効性と一般化可能性を示しています。特に、既存のすべての方法よりも高い改善を達成し、FashionpediaではAPの約3％、Kinetics-TPSではAccの約4％に達します。

Part-level attribute parsing is a fundamental but challenging task, which requires the region-level visual understanding to provide explainable details of body parts. Most existing approaches address this problem by adding a regional convolutional neural network (RCNN) with an attribute prediction head to a two-stage detector, in which attributes of body parts are identified from local-wise part boxes. However, local-wise part boxes with limit visual clues (i.e., part appearance only) lead to unsatisfying parsing results, since attributes of body parts are highly dependent on comprehensive relations among them. In this article, we propose a Knowledge Embedded RCNN (KE-RCNN) to identify attributes by leveraging rich knowledges, including implicit knowledge (e.g., the attribute ``above-the-hip'' for a shirt requires visual/geometry relations of shirt-hip) and explicit knowledge (e.g., the part of ``shorts'' cannot have the attribute of ``hoodie'' or ``lining''). Specifically, the KE-RCNN consists of two novel components, i.e., Implicit Knowledge based Encoder (IK-En) and Explicit Knowledge based Decoder (EK-De). The former is designed to enhance part-level representation by encoding part-part relational contexts into part boxes, and the latter one is proposed to decode attributes with a guidance of prior knowledge about part-attribute relations. In this way, the KE-RCNN is plug-and-play, which can be integrated into any two-stage detectors, e.g., Attribute-RCNN, Cascade-RCNN, HRNet based RCNN and SwinTransformer based RCNN. Extensive experiments conducted on two challenging benchmarks, e.g., Fashionpedia and Kinetics-TPS, demonstrate the effectiveness and generalizability of the KE-RCNN. In particular, it achieves higher improvements over all existing methods, reaching around 3% of AP on Fashionpedia and around 4% of Acc on Kinetics-TPS.

updated: Tue Jun 21 2022 07:05:14 GMT+0000 (UTC)

published: Tue Jun 21 2022 07:05:14 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト