ViTA: Visual-Linguistic Translation by Aligning Object Tags

Kshitij Gupta; Devansh Gautam; Radhika Mamidi

ViTA: オブジェクトタグの整列による視覚言語変換

マルチモーダル機械翻訳 (MMT) は、翻訳用の視覚情報でソーステキストを充実させます。近年人気が高まっており、同じ方向にいくつかのパイプラインが提案されています。しかし、このタスクには、翻訳システムにおける視覚的モダリティの貢献を説明するための質の高いデータセットが欠けています。この論文では、WAT 2021 の英語からヒンディー語へのマルチモーダル翻訳タスクのために、チーム名 Volta の下でシステムを提案します。また、事前学習済みの多言語シーケンスツーシーケンスモデルである mBART を使用する同じ言語ペアのテキストのみのサブタスクにも参加しています。マルチモーダル翻訳の場合、画像からオブジェクトタグを抽出することにより、視覚情報をテキストドメインに持ち込むことにより、テキスト入力を強化することを提案します。また、ソーステキストを体系的に劣化させることにより、システムの堅牢性を調査します。最後に、マルチモーダルタスクのテストセットとチャレンジセットで、BLEU スコア 44.6 と 51.6 を達成しました。

Multimodal Machine Translation (MMT) enriches the source text with visual information for translation. It has gained popularity in recent years, and several pipelines have been proposed in the same direction. Yet, the task lacks quality datasets to illustrate the contribution of visual modality in the translation systems. In this paper, we propose our system under the team name Volta for the Multimodal Translation Task of WAT 2021 from English to Hindi. We also participate in the textual-only subtask of the same language pair for which we use mBART, a pretrained multilingual sequence-to-sequence model. For multimodal translation, we propose to enhance the textual input by bringing the visual information to a textual domain by extracting object tags from the image. We also explore the robustness of our system by systematically degrading the source text. Finally, we achieve a BLEU score of 44.6 and 51.6 on the test set and challenge set of the multimodal task.

updated: Tue Jun 08 2021 12:26:04 GMT+0000 (UTC)

published: Tue Jun 01 2021 06:19:29 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト