Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Yabing Wang; Jianfeng Dong; Tianxiang Liang; Minsong Zhang; Rui Cai; Xun Wang

ノイズロバスト学習によるクロスリンガルクロスモーダル検索

クロスモーダル検索の分野における最近の開発にもかかわらず、手動で注釈を付けたデータセットがないため、リソースの少ない言語に焦点を当てた研究はあまり行われていません。この論文では、低リソース言語のためのノイズロバストなクロスリンガルクロスモーダル検索方法を提案します。この目的のために、機械翻訳 (MT) を使用して、リソースの少ない言語の疑似対訳文を作成します。ただし、MT は完全ではないため、翻訳中にノイズが発生する傾向があり、テキストの埋め込みが破損してレンダリングされ、検索パフォーマンスが低下します。これを軽減するために、マルチビュー自己蒸留法を導入して、ノイズに強いターゲット言語表現を学習します。これは、相互注意モジュールを使用してソフト擬似ターゲットを生成し、類似性ベースのビューと機能から直接監督を提供します-ベースビュー。さらに、教師なし MT の逆翻訳に触発されて、元の文と逆翻訳された文の間の意味の不一致を最小限に抑え、テキストエンコーダーのノイズロバスト性をさらに向上させます。さまざまな言語で 3 つのビデオテキストと画像テキストのクロスモーダル検索ベンチマークで広範な実験が行われ、その結果は、人間がラベル付けした余分なデータを使用せずに、私たちの方法が全体的なパフォーマンスを大幅に向上させることを示しています。さらに、最近の視覚と言語の事前トレーニングフレームワークである CLIP からの事前トレーニング済みのビジュアルエンコーダーを搭載することで、モデルは大幅なパフォーマンスの向上を達成し、私たちの方法が一般的な事前トレーニングモデルと互換性があることを示しています。コードとデータは https://github.com/HuiGuanLab/nrccr で入手できます。

Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated datasets. In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages. To this end, we use Machine Translation (MT) to construct pseudo-parallel sentence pairs for low-resource languages. However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we introduce a multi-view self-distillation method to learn noise-robust target-language representations, which employs a cross-attention module to generate soft pseudo-targets to provide direct supervision from the similarity-based view and feature-based view. Besides, inspired by the back-translation in unsupervised MT, we minimize the semantic discrepancies between origin sentences and back-translated sentences to further improve the noise robustness of the textual encoder. Extensive experiments are conducted on three video-text and image-text cross-modal retrieval benchmarks across different languages, and the results demonstrate that our method significantly improves the overall performance without using extra human-labeled data. In addition, equipped with a pre-trained visual encoder from a recent vision-and-language pre-training framework, i.e., CLIP, our model achieves a significant performance gain, showing that our method is compatible with popular pre-training models. Code and data are available at https://github.com/HuiGuanLab/nrccr.

updated: Fri Aug 26 2022 09:32:24 GMT+0000 (UTC)

published: Fri Aug 26 2022 09:32:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト