Adversarial synthesis based data-augmentation for code-switched spoken language identification

Parth Shastri; Chirag Patil; Poorval Wanere; Dr. Shrinivas Mahajan; Dr. Abhishek Bhatt; Dr. Hardik Sailor

コードスイッチングされた話し言葉の識別のための敵対的合成ベースのデータ拡張

音声言語識別（LID）は、音声セグメント内の言語を分類するために使用される自動音声認識（ASR）の重要なサブタスクです。自動LIDは、多言語の国で有用な役割を果たします。さまざまな国で、会話中に2つまたは3つ以上の言語が混在する多言語シナリオのため、言語の識別が困難になります。このような音声の現象は、コードミキシングまたはコードスイッチングと呼ばれます。この性質は、インドだけでなく、多くのアジア諸国でも続いています。このようなコード混合データは見つけるのが難しく、音声LIDの機能がさらに低下します。したがって、この作業は主に、コードスイッチクラスのデータ不足の解決策としてデータ拡張を使用してこの問題に対処します。この研究は、英語と混合されたインド語の言語コードに焦点を当てています。音声LIDは、英語とコードを組み合わせたヒンディー語で実行されます。この研究では、オーディオデータにMelスペクトログラムを使用して実行されるGenerative Adversarial Network（GAN）ベースのデータ拡張手法を提案します。 GANは、画像ドメインでの実際のデータ分布を表す際に正確であることがすでに証明されています。提案された研究は、音声分類、自動音声認識などの音声ドメインでGANのこれらの機能を活用します。GANは、マイノリティコード混合クラスのMelスペクトログラムを生成するようにトレーニングされ、分類器のデータを拡張するために使用されます。 GANを利用すると、ベースライン参照として使用される畳み込みリカレントニューラルネットワーク（CRNN）分類器と比較して、重み付けされていない平均リコールが全体的に3.5％向上します。

Spoken Language Identification (LID) is an important sub-task of Automatic Speech Recognition(ASR) that is used to classify the language(s) in an audio segment. Automatic LID plays an useful role in multilingual countries. In various countries, identifying a language becomes hard, due to the multilingual scenario where two or more than two languages are mixed together during conversation. Such phenomenon of speech is called as code-mixing or code-switching. This nature is followed not only in India but also in many Asian countries. Such code-mixed data is hard to find, which further reduces the capabilities of the spoken LID. Hence, this work primarily addresses this problem using data augmentation as a solution on the on the data scarcity of the code-switched class. This study focuses on Indic language code-mixed with English. Spoken LID is performed on Hindi, code-mixed with English. This research proposes Generative Adversarial Network (GAN) based data augmentation technique performed using Mel spectrograms for audio data. GANs have already been proven to be accurate in representing the real data distribution in the image domain. Proposed research exploits these capabilities of GANs in speech domains such as speech classification, automatic speech recognition, etc. GANs are trained to generate Mel spectrograms of the minority code-mixed class which are then used to augment data for the classifier. Utilizing GANs give an overall improvement on Unweighted Average Recall by an amount of 3.5% as compared to a Convolutional Recurrent Neural Network (CRNN) classifier used as the baseline reference.

updated: Wed Jun 01 2022 18:17:51 GMT+0000 (UTC)

published: Mon May 30 2022 06:41:13 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト