How to Describe Images in a More Funny Way? Towards a Modular Approach to Cross-Modal Sarcasm Generation

Jie Ruan; Yue Wu; Xiaojun Wan; Yuesheng Zhu

もっと面白い方法で画像を説明する方法は?クロスモーダル皮肉生成へのモジュラーアプローチに向けて

皮肉生成は、これまでの研究で、テキストからテキストへの生成、つまり入力文に対して皮肉な文を生成する問題と見なして調査されてきました。この論文では、クロスモーダル皮肉生成 (CMSG) の新しい問題、つまり、特定の画像の皮肉な説明を生成することを研究します。モデルが皮肉の特性と異なるモダリティ間の相関関係を満たす必要があるため、CMSG は困難です。さらに、想像力を必要とする 2 つのモダリティの間に矛盾があるはずです。さらに、高品質のトレーニングデータは不十分です。これらの問題に対処するために、ペアのトレーニングデータを使用せずに画像から皮肉な説明を生成するための一歩を踏み出し、クロスモデルの皮肉生成のための抽出-生成-ランキングベースのモジュラーメソッド (EGRM) を提案します。具体的には、EGRM はまず、さまざまなレベルで画像から多様な情報を抽出し、取得した画像タグ、感傷的な説明キャプション、および常識に基づく結果を使用して、皮肉なテキストの候補を生成します。次に、候補テキストから最終的なテキストを選択するために、イメージとテキストの関係、皮肉、および文法性を考慮する包括的なランキングアルゴリズムが提案されます。 8 つのシステムから生成された合計 1200 の生成された画像とテキストのペアに対する 5 つの基準での人間による評価と、補助的な自動評価は、この方法の優位性を示しています。

Sarcasm generation has been investigated in previous studies by considering it as a text-to-text generation problem, i.e., generating a sarcastic sentence for an input sentence. In this paper, we study a new problem of cross-modal sarcasm generation (CMSG), i.e., generating a sarcastic description for a given image. CMSG is challenging as models need to satisfy the characteristics of sarcasm, as well as the correlation between different modalities. In addition, there should be some inconsistency between the two modalities, which requires imagination. Moreover, high-quality training data is insufficient. To address these problems, we take a step toward generating sarcastic descriptions from images without paired training data and propose an Extraction-Generation-Ranking based Modular method (EGRM) for cross-model sarcasm generation. Specifically, EGRM first extracts diverse information from an image at different levels and uses the obtained image tags, sentimental descriptive caption, and commonsense-based consequence to generate candidate sarcastic texts. Then, a comprehensive ranking algorithm, which considers image-text relation, sarcasticness, and grammaticality, is proposed to select a final text from the candidate texts. Human evaluation at five criteria on a total of 1200 generated image-text pairs from eight systems and auxiliary automatic evaluation show the superiority of our method.

updated: Sun Nov 20 2022 14:38:24 GMT+0000 (UTC)

published: Sun Nov 20 2022 14:38:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト