MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish

Begum Citamak; Ozan Caglayan; Menekse Kuyu; Erkut Erdem; Aykut Erdem; Pranava Madhyastha; Lucia Specia

MSVD-トルコ語：トルコ語の統合された視覚と言語研究のための包括的なマルチモーダルデータセット

自然言語でのビデオ記述の自動生成は、ビデオキャプションとも呼ばれ、ビデオの視覚的コンテンツを理解し、シーン内のオブジェクトとアクションを表す自然言語の文を生成することを目的としています。しかし、この挑戦的な統合されたビジョンと言語の問題は、主に英語で対処されてきました。データの欠如と他の言語の言語特性は、そのような言語に対する既存のアプローチの成功を制限します。この論文では、英語とは非常に異なる特性を持つ形態学的に豊富で膠着語であるトルコ語を対象としています。そのために、MSVD（Microsoft Research Video Description Corpus）データセット内のビデオの英語の説明をトルコ語に注意深く翻訳することにより、この言語の最初の大規模なビデオキャプションデータセットを作成します。トルコ語のビデオキャプションの調査を可能にすることに加えて、英語とトルコ語の並行記述により、（マルチモーダル）機械翻訳におけるビデオコンテキストの役割の調査も可能になります。私たちの実験では、ビデオキャプションとマルチモーダル機械翻訳の両方のモデルを構築し、トルコ語の特性により適切に対処するために、さまざまな単語セグメンテーションアプローチとさまざまなニューラルアーキテクチャの効果を調査します。 MSVD-トルコ語のデータセットとこの作業で報告された結果が、トルコ語やその他の形態論が豊富で膠着語のビデオキャプションとマルチモーダル機械翻訳モデルの改善につながることを願っています。

Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other languages limit the success of existing approaches for such languages. In this paper we target Turkish, a morphologically rich and agglutinative language that has very different properties compared to English. To do so, we create the first large scale video captioning dataset for this language by carefully translating the English descriptions of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset into Turkish. In addition to enabling research in video captioning in Turkish, the parallel English-Turkish descriptions also enables the study of the role of video context in (multimodal) machine translation. In our experiments, we build models for both video captioning and multimodal machine translation and investigate the effect of different word segmentation approaches and different neural architectures to better address the properties of Turkish. We hope that the MSVD-Turkish dataset and the results reported in this work will lead to better video captioning and multimodal machine translation models for Turkish and other morphology rich and agglutinative languages.

updated: Sun Dec 13 2020 16:51:35 GMT+0000 (UTC)

published: Sun Dec 13 2020 16:51:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト