Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning

Xian Zhong; Zipeng Li; Shuqin Chen; Kui Jiang; Chen Chen; Mang Ye

ビデオキャプションの周波数拡散に向けた洗練されたセマンティックエンハンスメント

ビデオのキャプションは、特定のビデオを正確に説明する自然言語文を生成することを目的としています。既存の方法は、エンコード段階でより豊富な視覚的表現を探索するか、デコード能力を向上させることにより、有利な生成を取得します。ただし、ロングテールの問題は、低頻度のトークンでのこれらの試みを妨げます。これはめったに発生しませんが、重要なセマンティクスを持ち、詳細な生成で重要な役割を果たします。このホワイトペーパーでは、頻繁に使用されないトークンの言語表現を常に認識するキャプションモデルである、周波数拡散 (RSFD) に向けた新しい洗練されたセマンティック拡張方法を紹介します。具体的には、周波数認識拡散 (FAD) モジュールを提案して、低頻度トークンのセマンティクスを理解し、世代の限界を打ち破ります。このように、出現頻度の低いトークンの吸収を促すことでキャプションを洗練させています。 FADに基づいて、拡散プロセスによってもたらされる高頻度トークンの情報損失を補償するダイバージェントセマンティックスーパーバイザー（DSS）モジュールを設計します。低頻度トークンのセマンティクスがさらに強調され、ロングテールの問題が緩和されます。広範な実験は、RSFD が 2 つのベンチマークデータセット、つまり MSR-VTT と MSVD で最先端の方法よりも優れていることを示しており、低頻度トークンセマンティクスの強化が競合生成効果を得ることができることを示しています。コードは https://github.com/lzp870/RSFD で入手できます。

Video captioning aims to generate natural language sentences that describe the given video accurately. Existing methods obtain favorable generation by exploring richer visual representations in encode phase or improving the decoding ability. However, the long-tailed problem hinders these attempts at low-frequency tokens, which rarely occur but carry critical semantics, playing a vital role in the detailed generation. In this paper, we introduce a novel Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a captioning model that constantly perceives the linguistic representation of the infrequent tokens. Concretely, a Frequency-Aware Diffusion (FAD) module is proposed to comprehend the semantics of low-frequency tokens to break through generation limitations. In this way, the caption is refined by promoting the absorption of tokens with insufficient occurrence. Based on FAD, we design a Divergent Semantic Supervisor (DSS) module to compensate for the information loss of high-frequency tokens brought by the diffusion process, where the semantics of low-frequency tokens is further emphasized to alleviate the long-tailed problem. Extensive experiments indicate that RSFD outperforms the state-of-the-art methods on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate that the enhancement of low-frequency tokens semantics can obtain a competitive generation effect. Code is available at https://github.com/lzp870/RSFD.

updated: Sun Dec 18 2022 03:21:40 GMT+0000 (UTC)

published: Mon Nov 28 2022 05:45:17 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト