Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

Qifan Yu; Juncheng Li; Yu Wu; Siliang Tang; Wei Ji; Yueting Zhuang

オープンワールドでのきめ細かいシーングラフ生成のための視覚的に指示された言語モデル

シーングラフ生成 (SGG) は抽出を目的としています。視覚を理解するための画像内の関係性。 SGG に関する最近の研究は着実に進歩していますが、尾部述語はトレーニングにコストがかかり、頻繁な述語と比較して注釈付きデータの量が少ないために区別するのが難しいというロングテール分布の問題に依然として悩まされています。既存の再バランス戦略は、事前のルールによってこれを処理しようとしますが、依然として事前定義された条件に制限されており、さまざまなモデルやデータセットに対して拡張可能ではありません。この論文では、クロスモーダル述語ブースティング (CaCao) フレームワークを提案します。このフレームワークでは、視覚的に指示された言語モデルを学習して、低リソースの方法でさまざまなきめの細かい述語を生成します。提案された CaCao はプラグアンドプレイ方式で適用でき、既存の SGG を自動的に強化してロングテール問題に取り組むことができます。これに基づいて、オープンワールド述語シーングラフ生成 (Epic) のための新しい Entangled クロスモーダルプロンプトアプローチをさらに導入します。このアプローチでは、モデルがゼロショット方式で目に見えない述語に一般化できます。 3 つのベンチマークデータセットに対する包括的な実験により、CaCao がモデルに依存しない方法で複数のシーングラフ生成モデルのパフォーマンスを一貫して向上させることが示されました。さらに、当社の Epic は、オープンワールドの述語予測において競争力のあるパフォーマンスを実現します。この論文のデータとコードは公開されています。

Scene Graph Generation (SGG) aims to extract relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution issues that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to handle it via prior rules but are still confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction. The data and code for this paper are publicly available.

updated: Sat Aug 19 2023 14:41:36 GMT+0000 (UTC)

published: Thu Mar 23 2023 13:06:38 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト