Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

Qianji Di; Wenxi Ma; Zhongang Qi; Tianxiang Hou; Ying Shan; Hanzi Wang

目に見えないトリプルに向けて: シーングラフ生成のための効果的なテキスト-画像-結合学習

シーングラフ生成 (SGG) は、画像内のオブジェクトとその接続を構造的かつ包括的に表現することを目的としており、シーンの理解やその他の関連する下流タスクに大きなメリットをもたらします。既存の SGG モデルは、偏ったデータセットによって引き起こされるロングテール問題の解決に苦労することがよくあります。ただし、これらのモデルが特定のデータセットによりよく適合できたとしても、トレーニングセットに含まれていない目に見えないトリプルを解決するのは難しい場合があります。ほとんどの手法は、トリプル全体をフィードし、統計的機械学習に基づいて全体的な特徴を学習する傾向があります。このようなモデルでは、トレーニングセット内のオブジェクトと述語がテストセット内の新しいトリプルとして異なる方法で結合されるため、目に見えないトリプルを予測するのが困難です。この研究では、目に見えないトリプルを解決し、SGG モデルの汎化機能を向上させるために、Text-Image-joint Scene Graph Generation (TISGG) モデルを提案します。我々は、オブジェクトと述語のカテゴリを特徴レベルで個別に学習し、それらを対応する視覚特徴と整合させるための共同特徴学習 (JFL) モジュールと事実知識ベースの洗練 (FKR) モジュールを提案します。これにより、モデルはトリプルマッチングに限定されなくなります。さらに、ロングテール問題は汎化能力にも影響を与えることが観察されたため、特徴ガイド付きサンプリング (CGS) や情報再重み付け (IR) モジュールを含む、新しいバランスのとれた学習戦略を設計し、オーダーメイドの学習方法を提供します。それぞれの述語の性質に応じて。広範な実験により、私たちのモデルが最先端のパフォーマンスを達成していることが示されています。より詳細には、TISGG は、Visual Genome データセットの PredCls サブタスクで zR@20 (ゼロショットリコール) の 11.7% だけパフォーマンスを向上させます。

Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images, it can significantly benefit scene understanding and other related downstream tasks. Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets. However, even if these models can fit specific datasets better, it may be hard for them to resolve the unseen triples which are not included in the training set. Most methods tend to feed a whole triple and learn the overall features based on statistical machine learning. Such models have difficulty predicting unseen triples because the objects and predicates in the training set are combined differently as novel triples in the test set. In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models. We propose a Joint Fearture Learning (JFL) module and a Factual Knowledge based Refinement (FKR) module to learn object and predicate categories separately at the feature level and align them with corresponding visual features so that the model is no longer limited to triples matching. Besides, since we observe the long-tailed problem also affects the generalization ability, we design a novel balanced learning strategy, including a Charater Guided Sampling (CGS) and an Informative Re-weighting (IR) module, to provide tailor-made learning methods for each predicate according to their characters. Extensive experiments show that our model achieves state-of-the-art performance. In more detail, TISGG boosts the performances by 11.7% of zR@20(zero-shot recall) on the PredCls sub-task on the Visual Genome dataset.

updated: Fri Jun 23 2023 10:17:56 GMT+0000 (UTC)

published: Fri Jun 23 2023 10:17:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト