Late multimodal fusion for image and audio music transcription

María Alfaro-Contreras; Jose J. Valero-Mas; José M. Iñesta; Jorge Calvo-Zaragoza

画像と音声の音楽トランスクリプションのための後期マルチモーダルフュージョン

音楽ソースを構造化されたデジタル形式に変換する音楽トランスクリプションは、音楽情報検索（MIR）の重要な問題です。 MIRコミュニティは、この課題に計算用語で取り組む場合、2つの研究ラインに従います。光学音楽認識（OMR）の場合の音楽ドキュメント、または自動音楽トランスクリプション（AMT）の場合のオーディオ録音です。前述の入力データの異なる性質により、これらのフィールドはモダリティ固有のフレームワークを開発するように調整されています。ただし、シーケンスのラベル付けタスクに関する最近の定義は、共通の出力表現につながり、組み合わせたパラダイムの研究を可能にします。この点で、マルチモーダル画像と音声音楽のトランスクリプションは、画像と音声のモダリティによって伝達される情報を効果的に組み合わせるという課題を含みます。この作業では、この質問を後期融合レベルで調査します。格子ベースの検索空間でエンドツーエンドのOMRおよびAMTシステムに関する仮説を初めて統合するために、4つの組み合わせアプローチを研究します。対応する単一モダリティモデルが異なるエラー率をもたらす一連のパフォーマンスシナリオで得られた結果は、これらのアプローチの興味深い利点を示しました。さらに、検討された4つの戦略のうち2つは、対応する単峰性の標準認識フレームワークを大幅に改善します。

Music transcription, which deals with the conversion of music sources into a structured digital format, is a key problem for Music Information Retrieval (MIR). When addressing this challenge in computational terms, the MIR community follows two lines of research: music documents, which is the case of Optical Music Recognition (OMR), or audio recordings, which is the case of Automatic Music Transcription (AMT). The different nature of the aforementioned input data has conditioned these fields to develop modality-specific frameworks. However, their recent definition in terms of sequence labeling tasks leads to a common output representation, which enables research on a combined paradigm. In this respect, multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities. In this work, we explore this question at a late-fusion level: we study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems in a lattice-based search space. The results obtained for a series of performance scenarios -- in which the corresponding single-modality models yield different error rates -- showed interesting benefits of these approaches. In addition, two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.

updated: Fri Aug 26 2022 10:09:51 GMT+0000 (UTC)

published: Wed Apr 06 2022 20:00:33 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト