Parallel Vertex Diffusion for Unified Visual Grounding

Zesen Cheng; Kehan Li; Peng Jin; Xiangyang Ji; Li Yuan; Chang Liu; Jie Chen

統一されたビジュアルグラウンディングのための平行頂点拡散

統一された視覚的な根拠は、タスク固有の設計が少ないマルチタスクデータを活用するためのシンプルで一般的な技術的ルートを追求します。最も高度な方法は通常、ボックスとマスクを頂点シーケンスとして提示し、参照検出とセグメンテーションを自己回帰シーケンシャル頂点生成パラダイムとしてモデル化します。ただし、高次元の頂点シーケンスを連続して生成するとエラーが発生しやすくなります。これは、シーケンスの上流が静的なままであり、下流の頂点情報に基づいて調整できないためです。さらに、頂点が限られているため、複雑な輪郭を持つオブジェクトの劣ったフィッティングにより、パフォーマンスの上限が制限されます。このジレンマに対処するために、ノイズ次元を変更するだけで、拡散モデルを使用した優れた高次元スケーラビリティのための並列頂点生成パラダイムを提案します。私たちのパラダイムの直感的な具体化は、頂点座標を生成ターゲットとして直接設定し、拡散モデルを使用してトレーニングと推論を行う Parallel Vertex Diffusion (PVD) です。これには 2 つの欠陥があると主張します。(1) 正規化されていない座標が損失値の大きな分散を引き起こしました。 (2) PVD の元のトレーニング目的は、ポイントの一貫性のみを考慮し、ジオメトリの一貫性を無視します。最初の欠点を解決するために、センターアンカーメカニズム (CAM) は、座標を正規化されたオフセット値として変換し、トレーニングロス値を安定させるように設計されています。 2 番目の欠陥については、Angle summation loss (ASL) は、ジオメトリレベルの一貫性のために、予測頂点とグラウンドトゥルース頂点のジオメトリの違いを制限するように設計されています。経験的な結果は、PVD が参照検出とセグメンテーションの両方で最先端を達成し、パラダイムが高次元データを使用した順次頂点生成よりもスケーラブルで効率的であることを示しています。

Unified visual grounding pursues a simple and generic technical route to leverage multi-task data with less task-specific design. The most advanced methods typically present boxes and masks as vertex sequences to model referring detection and segmentation as an autoregressive sequential vertex generation paradigm. However, generating high-dimensional vertex sequences sequentially is error-prone because the upstream of the sequence remains static and cannot be refined based on downstream vertex information, even if there is a significant location gap. Besides, with limited vertexes, the inferior fitting of objects with complex contours restricts the performance upper bound. To deal with this dilemma, we propose a parallel vertex generation paradigm for superior high-dimension scalability with a diffusion model by simply modifying the noise dimension. An intuitive materialization of our paradigm is Parallel Vertex Diffusion (PVD) to directly set vertex coordinates as the generation target and use a diffusion model to train and infer. We claim that it has two flaws: (1) unnormalized coordinate caused a high variance of loss value; (2) the original training objective of PVD only considers point consistency but ignores geometry consistency. To solve the first flaw, Center Anchor Mechanism (CAM) is designed to convert coordinates as normalized offset values to stabilize the training loss value. For the second flaw, Angle summation loss (ASL) is designed to constrain the geometry difference of prediction and ground truth vertexes for geometry-level consistency. Empirical results show that our PVD achieves state-of-the-art in both referring detection and segmentation, and our paradigm is more scalable and efficient than sequential vertex generation with high-dimension data.

updated: Mon Mar 13 2023 15:51:38 GMT+0000 (UTC)

published: Mon Mar 13 2023 15:51:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト