Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning

Hao Zhang; Nianwen Si; Yaqi Chen; Wenlin Zhang; Xukui Yang; Dan Qu; Wei-Qiang Zhang

クロスモーダルマルチグレイン対照学習による音声翻訳の改善

エンドツーエンドの音声翻訳 (E2E-ST) モデルは、待ち時間が短く、エラーの伝播が少ないため、徐々に主流のパラダイムになりました。ただし、タスクの複雑さとデータ不足のため、そのようなモデルを適切にトレーニングすることは自明ではありません。音声とテキストのモダリティの違いにより、E2E-ST モデルのパフォーマンスは通常、対応する機械翻訳 (MT) モデルよりも劣ります。上記の観察に基づいて、既存の方法は、多くの場合、共有メカニズムを使用して、さまざまな制約を課すことによって暗黙の知識転送を実行します。ただし、最終的なモデルは、単独でトレーニングされた MT モデルよりも MT タスクでパフォーマンスが低下することがよくあります。これは、この方法の知識伝達能力も制限されることを意味します。これらの問題に対処するために、E2E-ST の FCCL (Fine-and Coarse-Granularity Contrastive Learning) アプローチを提案します。これは、クロスモーダルマルチグレイン対比学習を通じて明示的な知識伝達を行います。私たちのアプローチの重要な要素は、文レベルとフレームレベルの両方で対照学習を適用して、豊富な意味情報を含む音声表現を抽出するための包括的なガイドを提供することです。、コントラスト学習に悪影響を及ぼします。 MuST-C ベンチマークでの実験では、提案されたアプローチが 8 つの言語ペアすべてで最先端の E2E-ST ベースラインよりも大幅に優れていることが示されています。さらなる分析は、FCCLが文法構造情報を学習する能力を解放し、より多くの層に意味情報を学習させることができることを示しています。

The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharingmechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL (Fine- and Coarse- Granularity Contrastive Learning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information.In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.

updated: Thu Apr 20 2023 13:41:56 GMT+0000 (UTC)

published: Thu Apr 20 2023 13:41:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト