Topic Scene Graph Generation by Attention Distillation from Caption

W. Wang; R. Wang; X. Chen

キャプションからの注意蒸留によるトピックシーングラフの生成

画像が物語を語っている場合、画像のキャプションが最も短いナレーターです。一般に、シーングラフは全知のジェネラリストであることが好まれますが、画像のキャプションは、要点を概説するスペシャリストであることがより積極的です。以前の多くの研究では、シーングラフは、些細な内容やノイズを減らすことができない限り、期待したほど実用的ではないことがわかっています。この点で、画像のキャプションは優れた家庭教師です。この目的のために、シーングラフに画像のキャプションから機能を借用させて、それがオールラウンドであることに基づいてスペシャリストになることができるようにし、いわゆるトピックシーングラフを作成します。画像のキャプションが注目するものは抽出され、部分的なオブジェクト、関係、およびイベントの重要性を推定するためにシーングラフに渡されます。具体的には、キャプションの生成中に、各タイムステップの個々のオブジェクトに関する注意が収集、プール、および組み立てられて、関係に関する注意が取得されます。これは、関係の推定重要度スコアを正規化するための弱い監視として機能します。さらに、この注意蒸留プロセスは、画像キャプションとシーングラフの生成を組み合わせる機会を提供するため、単一生成モデルを画像キャプションと共有することにより、シーングラフを豊かで自由な表現の言語形式にさらに変換します。実験によると、注意蒸留は強力な監督なしに重要な関係のマイニングに大幅な改善をもたらし、トピックシーングラフはその後のアプリケーションで大きな可能性を示しています。

If an image tells a story, the image caption is the briefest narrator. Generally, a scene graph prefers to be an omniscient generalist, while the image caption is more willing to be a specialist, which outlines the gist. Lots of previous studies have found that a scene graph is not as practical as expected unless it can reduce the trivial contents and noises. In this respect, the image caption is a good tutor. To this end, we let the scene graph borrow the ability from the image caption so that it can be a specialist on the basis of remaining all-around, resulting in the so-called Topic Scene Graph. What an image caption pays attention to is distilled and passed to the scene graph for estimating the importance of partial objects, relationships, and events. Specifically, during the caption generation, the attention about individual objects in each time step is collected, pooled, and assembled to obtain the attention about relationships, which serves as weak supervision for regularizing the estimated importance scores of relationships. In addition, as this attention distillation process provides an opportunity for combining the generation of image caption and scene graph together, we further transform the scene graph into linguistic form with rich and free-form expressions by sharing a single generation model with image caption. Experiments show that attention distillation brings significant improvements in mining important relationships without strong supervision, and the topic scene graph shows great potential in subsequent applications.

updated: Tue Oct 12 2021 04:26:12 GMT+0000 (UTC)

published: Tue Oct 12 2021 04:26:12 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト