Context-aware Mixture-of-Experts for Unbiased Scene Graph Generation

Liguang Zhou; Yuhongze Zhou; Tin Lun Lam; Yangsheng Xu

偏りのないシーングラフ生成のためのコンテキストを意識した専門家の混合

近年、シーングラフの生成は目覚ましい進歩を遂げています。ただし、述語クラスの本質的なロングテール分布は困難な問題です。ほとんどすべての既存のシーングラフ生成 (SGG) メソッドは同じフレームワークに従っており、オブジェクト検出には同様のバックボーンネットワークを使用し、シーングラフ生成にはカスタマイズされたネットワークを使用します。これらの方法は、非常に不均衡なデータ分布のネットワークモデルの学習機能を改善するために、固有の述語と複雑なネットワークに関してシーンコンテキストの固有の関連性を抽出するように洗練されたコンテキストエンコーダーを設計することがよくあります。偏りのない SGG 問題に対処するために、Context-Aware Mixture-of-Experts (CAME) と呼ばれるシンプルで効果的な方法を提示して、モデルの多様性を改善し、洗練された設計なしで偏りのある SGG を軽減します。具体的には、専門家の混合を使用して、ほとんどの公平なシーングラフジェネレーターに適した、述語クラスの非常に長い裾の分布を修正することを提案します。関係の専門家の混合により、述語のロングテール分布は、分割してアンサンブルする方法で対処されます。その結果、偏った SGG が軽減され、モデルはよりバランスの取れた述語予測を行う傾向があります。ただし、重みが同じエキスパートでも、さまざまなレベルの述語分布を区別できるほど多様ではありません。したがって、組み込みのコンテキスト認識エンコーダーを使用するだけで、ネットワークが豊富なシーンの特性を動的に活用してモデルの多様性をさらに高めることができます。画像のコンテキスト情報を利用することにより、シーンのコンテキストに関する各エキスパートの重要性が動的に割り当てられます。以前の方法よりも優れたパフォーマンスを達成したことを示すために、Visual Genome データセットで 3 つのタスクについて広範な実験を行いました。

The scene graph generation has gained tremendous progress in recent years. However, its intrinsic long-tailed distribution of predicate classes is a challenging problem. Almost all existing scene graph generation (SGG) methods follow the same framework where they use a similar backbone network for object detection and a customized network for scene graph generation. These methods often design the sophisticated context-encoder to extract the inherent relevance of scene context w.r.t the intrinsic predicates and complicated networks to improve the learning capabilities of the network model for highly imbalanced data distributions. To address the unbiased SGG problem, we present a simple yet effective method called Context-Aware Mixture-of-Experts (CAME) to improve the model diversity and alleviate the biased SGG without a sophisticated design. Specifically, we propose to use the mixture of experts to remedy the heavily long-tailed distributions of predicate classes, which is suitable for most unbiased scene graph generators. With a mixture of relation experts, the long-tailed distribution of predicates is addressed in a divide and ensemble manner. As a result, the biased SGG is mitigated and the model tends to make more balanced predicates predictions. However, experts with the same weight are not sufficiently diverse to discriminate the different levels of predicates distributions. Hence, we simply use the build-in context-aware encoder, to help the network dynamically leverage the rich scene characteristics to further increase the diversity of the model. By utilizing the context information of the image, the importance of each expert w.r.t the scene context is dynamically assigned. We have conducted extensive experiments on three tasks on the Visual Genome dataset to show that came achieved superior performance over previous methods.

updated: Sun Aug 21 2022 06:36:41 GMT+0000 (UTC)

published: Mon Aug 15 2022 10:39:55 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト