Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

Kaifeng Gao; Long Chen; Yulei Niu; Jian Shao; Jun Xiao

分類-その後の接地：ビデオシーングラフを時間的2部グラフとして再定式化する

今日のVidSGGモデルはすべてプロポーザルベースの方法です。つまり、最初にプロポーザルとして多数のペアの主語-目的語スニペットを生成し、次に各プロポーザルの述語分類を実行します。このホワイトペーパーでは、この一般的な提案ベースのフレームワークには3つの固有の欠点があると主張します。1）提案のグラウンドトゥルース述語ラベルは部分的に正しい。 2）同じ主語と目的語のペアの異なる述語インスタンス間の高次の関係を壊します。 3）VidSGGのパフォーマンスは、提案の品質によって制限されます。この目的のために、VidSGGの新しい分類と接地のフレームワークを提案します。これにより、見落とされていた3つの欠点すべてを回避できます。一方、このフレームワークでは、ビデオシーングラフを時間的2部グラフとして再定式化します。ここで、エンティティと述語はタイムスロットを持つ2種類のノードであり、エッジはこれらのノード間の異なる意味的役割を示します。この定式化は、新しいフレームワークを最大限に活用します。したがって、我々はさらに、新しい２部グラフベースのＳＧＧモデル：ＢＩＧを提案する。具体的には、BIGは分類段階と接地段階の2つの部分で構成され、前者はすべてのノードとエッジのカテゴリを分類することを目的とし、後者は各関係インスタンスの時間的位置をローカライズしようとします。 2つのVidSGGデータセットに対する広範なアブレーションは、フレームワークとBIGの有効性を証明しています。

Today's VidSGG models are all proposal-based methods, i.e., they first generate numerous paired subject-object snippets as proposals, and then conduct predicate classification for each proposal. In this paper, we argue that this prevalent proposal-based framework has three inherent drawbacks: 1) The ground-truth predicate labels for proposals are partially correct. 2) They break the high-order relations among different predicate instances of a same subject-object pair. 3) VidSGG performance is upper-bounded by the quality of the proposals. To this end, we propose a new classification-then-grounding framework for VidSGG, which can avoid all the three overlooked drawbacks. Meanwhile, under this framework, we reformulate the video scene graphs as temporal bipartite graphs, where the entities and predicates are two types of nodes with time slots, and the edges denote different semantic roles between these nodes. This formulation takes full advantage of our new framework. Accordingly, we further propose a novel BIpartite Graph based SGG model: BIG. Specifically, BIG consists of two parts: a classification stage and a grounding stage, where the former aims to classify the categories of all the nodes and the edges, and the latter tries to localize the temporal location of each relation instance. Extensive ablations on two VidSGG datasets have attested to the effectiveness of our framework and BIG.

updated: Wed Dec 08 2021 10:49:09 GMT+0000 (UTC)

published: Wed Dec 08 2021 10:49:09 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト