Single-Stage Visual Relationship Learning using Conditional Queries

Alakh Desai; Tz-Ying Wu; Subarna Tripathi; Nuno Vasconcelos

条件付きクエリを使用した単一段階の視覚的関係学習

シーングラフ生成 (SGG) の研究では、通常 2 段階のモデルが考慮されます。つまり、一連のエンティティを検出し、続いてそれらを組み合わせて、考えられるすべての関係にラベルを付けます。パイプライン構造は有望な結果を示していますが、パラメーターと計算のオーバーヘッドが大きくなり、通常はエンドツーエンドの最適化が妨げられます。これに対処するために、最近の研究では、計算効率の高い単一段階モデルをトレーニングすることが試みられています。セットベースの検出モデルである DETR の出現により、1 段階モデルは、主語、述語、目的語の 3 つのセットを 1 回のショットで直接予測しようとします。ただし、SGG は本質的に、エンティティ分布と述語分布のモデル化を同時に必要とするマルチタスクの学習問題です。この論文では、SGG 用の条件付きクエリを備えた Transformer、つまり、マルチタスク学習問題と組み合わせエンティティペアの分布を回避する SGG 用の新しい定式化を備えた TraCQ を提案します。 DETR ベースのエンコーダー/デコーダー設計を採用し、条件付きクエリを活用してエンティティラベルスペースも大幅に削減します。これにより、最先端の単一ステージモデルと比較してパラメーターが 20% 少なくなります。実験結果は、TraCQ が既存の 1 段階シーングラフ生成方法よりも優れているだけでなく、Visual Genome データセット上の多くの最先端の 2 段階方法よりも優れており、さらにエンドツーエンドのトレーニングと高速化が可能であることを示しています。推論。

Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labeling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage models that are computationally efficient. With the advent of DETR, a set based detection model, one-stage models attempt to predict a set of subject-predicate-object triplets directly in a single shot. However, SGG is inherently a multi-task learning problem that requires modeling entity and predicate distributions simultaneously. In this paper, we propose Transformers with conditional queries for SGG, namely, TraCQ with a new formulation for SGG that avoids the multi-task learning problem and the combinatorial entity pair distribution. We employ a DETR-based encoder-decoder design and leverage conditional queries to significantly reduce the entity label space as well, which leads to 20% fewer parameters compared to state-of-the-art single-stage models. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset, yet is capable of end-to-end training and faster inference.

updated: Fri Jun 09 2023 06:02:01 GMT+0000 (UTC)

published: Fri Jun 09 2023 06:02:01 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト