HyperLearn: A Distributed Approach for Representation Learning in   Datasets With Many Modalities

Devanshu Arya; Stevan Rudinac; Marcel Worring

HyperLearn：多くのモダリティを持つデータセットでの表現学習のための分散アプローチ

HyperLearn: A Distributed Approach for Representation Learning in Datasets With Many Modalities

マルチモーダルデータセットには膨大な量のリレーショナル情報が含まれており、新しいモダリティが導入されると指数関数的に増加します。このようなシナリオでの学習表現は、複数の異種情報チャネルが存在するため、本質的に複雑です。これらのチャネルは、（a）異なるモダリティのアイテム間の相互関係と（b）同じモダリティのアイテム間の内部関係の両方をエンコードできます。特に目標がエンドツーエンドの統一学習フレームワークである場合、両方のタイプの関係をキャプチャして保存するように、マルチメディアアイテムを連続的な低次元のセマンティック空間にエンコードすることは非常に困難です。対処する必要がある2つの重要な課題は、1）フレームワークが貴重な情報を失うことなく複雑な内部および相互関係をマージできること、2）学習モデルが新しく、潜在的に非常に異なるモダリティの追加に対して不変であることです。この論文では、多くのモダリティからのデータストリームに拡張できる柔軟なフレームワークを提案します。そのために、データ表現用のハイパーグラフベースのモデルを導入し、グラフ畳み込みネットワークを展開して、モダリティ内およびモダリティ間で関係情報を融合します。当社のアプローチは、精度を犠牲にすることなく、非常に高い計算コストや実行不可能なトレーニングプロセスを複数のGPUに分散するための効率的なソリューションを提供します。さらに、モデルに新しいモダリティを追加するには、計算時間を変更せずに追加のGPUユニットのみが必要であり、真にマルチモーダルなデータセットに表現学習をもたらします。二次、三次、四次の関係を特徴とするマルチメディアデータセットの実験で、アプローチの実行可能性を示します。

Multimodal datasets contain an enormous amount of relational information, which grows exponentially with the introduction of new modalities. Learning representations in such a scenario is inherently complex due to the presence of multiple heterogeneous information channels. These channels can encode both (a) inter-relations between the items of different modalities and (b) intra-relations between the items of the same modality. Encoding multimedia items into a continuous low-dimensional semantic space such that both types of relations are captured and preserved is extremely challenging, especially if the goal is a unified end-to-end learning framework. The two key challenges that need to be addressed are: 1) the framework must be able to merge complex intra and inter relations without losing any valuable information and 2) the learning model should be invariant to the addition of new and potentially very different modalities. In this paper, we propose a flexible framework which can scale to data streams from many modalities. To that end we introduce a hypergraph-based model for data representation and deploy Graph Convolutional Networks to fuse relational information within and across modalities. Our approach provides an efficient solution for distributing otherwise extremely computationally expensive or even unfeasible training processes across multiple-GPUs, without any sacrifices in accuracy. Moreover, adding new modalities to our model requires only an additional GPU unit keeping the computational time unchanged, which brings representation learning to truly multimodal datasets. We demonstrate the feasibility of our approach in the experiments on multimedia datasets featuring second, third and fourth order relations.

updated: Thu Sep 19 2019 22:45:21 GMT+0000 (UTC)

published: Thu Sep 19 2019 22:45:21 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト