PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Zihan Ding; Zi-han Ding; Tianrui Hui; Junshi Huang; Xiaoming Wei; Xiaolin Wei; Si Liu

PPMN: ワンステージパノプティックナラティブグラウンディングのためのピクセルフレーズマッチングネットワーク

パノプティックナラティブグラウンディング (PNG) は、静止画像の密集した物語のキャプションによって記述された物や物のカテゴリの視覚オブジェクトをセグメント化することを目標とする新たなタスクです。前の 2 段階のアプローチでは、まず既製のパノプティックセグメンテーションモデルによってセグメンテーション領域の提案を抽出し、次に粗い領域フレーズマッチングを実行して、各名詞句の候補領域をグラウンディングします。ただし、通常、2 段階のパイプラインは、最初の段階での低品質の提案によるパフォーマンスの制限と、領域の特徴のプーリングによる空間の詳細の損失、およびモノとモノのカテゴリごとに個別に設計された複雑な戦略に悩まされます。これらの欠点を軽減するために、1 段階のエンドツーエンドのピクセルフレーズマッチングネットワーク (PPMN) を提案します。PPMN は、地域の提案ではなく、各フレーズを対応するピクセルに直接一致させ、単純な組み合わせでパノプティックセグメンテーションを出力します。したがって、私たちのモデルは、まばらな領域とフレーズのペアではなく、密に注釈が付けられたピクセルとフレーズのペアの監視から、十分かつ細かいクロスモーダルセマンティック対応を活用できます。さらに、Language-Compatible Pixel Aggregation (LCPA) モジュールを提案し、マルチラウンドリファインメントを通じてフレーズ特徴の識別能力をさらに強化します。これにより、各フレーズに最も互換性のあるピクセルが選択され、対応する視覚的コンテキストが適応的に集約されます。広範な実験により、私たちの方法が PNG ベンチマークで 4.0 の絶対平均再現率の向上という新しい最先端のパフォーマンスを達成することが示されています。

Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to segment visual objects of things and stuff categories described by dense narrative captions of a still image. The previous two-stage approach first extracts segmentation region proposals by an off-the-shelf panoptic segmentation model, then conducts coarse region-phrase matching to ground the candidate regions for each noun phrase. However, the two-stage pipeline usually suffers from the performance limitation of low-quality proposals in the first stage and the loss of spatial details caused by region feature pooling, as well as complicated strategies designed for things and stuff categories separately. To alleviate these drawbacks, we propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals and outputs panoptic segmentation by simple combination. Thus, our model can exploit sufficient and finer cross-modal semantic correspondence from the supervision of densely annotated pixel-phrase pairs rather than sparse region-phrase pairs. In addition, we also propose a Language-Compatible Pixel Aggregation (LCPA) module to further enhance the discriminative ability of phrase features through multi-round refinement, which selects the most compatible pixels for each phrase to adaptively aggregate the corresponding visual context. Extensive experiments show that our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains.

updated: Thu Aug 11 2022 05:42:12 GMT+0000 (UTC)

published: Thu Aug 11 2022 05:42:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト