Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Paul Pu Liang; Amir Zadeh; Louis-Philippe Morency

マルチモーダル機械学習の基礎と最近の傾向: 原則、課題、未解決の問題

マルチモーダル機械学習は、言語、音響、視覚、触覚、生理学的メッセージなど、複数のコミュニケーションモダリティを統合することにより、理解、推論、学習などのインテリジェントな機能を備えたコンピューターエージェントを設計することを目的とした、活発な学際的な研究分野です。ヘルスケアやロボット工学などのアプリケーションドメインにおけるビデオ理解、具体化された自律型エージェント、テキストから画像への生成、およびマルチセンサー融合への最近の関心により、マルチモーダル機械学習は、機械学習の不均一性を考慮して、機械学習コミュニティに独自の計算上および理論上の課題をもたらしました。モダリティ間でよく見られるデータソースと相互接続。しかし、マルチモーダル研究の進歩の幅広さにより、この分野で共通のテーマや未解決の問題を特定することが困難になっています。このホワイトペーパーは、過去と最近の両方の観点から幅広いアプリケーションドメインと理論的フレームワークを統合することにより、マルチモーダル機械学習の計算および理論的基礎の概要を提供するように設計されています。モダリティの不均一性と相互接続という 2 つの重要な原則を定義することから始め、その後のイノベーションを推進し、過去と最近の傾向をカバーする表現、調整、推論、生成、転移、定量化という 6 つの主要な技術的課題の分類法を提案します。最近の技術的成果は、この分類法のレンズを通して提示され、研究者が新しいアプローチ間の類似点と相違点を理解できるようにします。最後に、分類法によって特定された将来の研究のためにいくつかの未解決の問題を動機付けます。

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining two key principles of modality heterogeneity and interconnections that have driven subsequent innovations, and propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy.

updated: Wed Sep 07 2022 19:21:19 GMT+0000 (UTC)

published: Wed Sep 07 2022 19:21:19 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト