Semi-Supervised Multi-Modal Multi-Instance Multi-Label Deep Network with Optimal Transport

Yang Yang; Zhao-Yang Fu; De-Chuan Zhan; Zhi-Bin Liu; Yuan Jiang

最適なトランスポートを備えた半教師ありマルチモーダルマルチインスタンスマルチラベルディープネットワーク

複雑なオブジェクトは通常、複数のラベルが付いており、複数のモーダル表現で表すことができます。たとえば、複雑な記事には、テキストと画像の情報、および複数の注釈が含まれています。以前の方法では、同種のマルチモーダルデータは一貫していると想定していますが、実際のアプリケーションでは、生データは無秩序です。たとえば、記事はさまざまな数の一貫性のないテキストと画像のインスタンスで構成されています。したがって、マルチモーダルマルチインスタンスマルチラベル（M3）学習は、そのようなタスクを処理するためのフレームワークを提供し、優れたパフォーマンスを示しています。ただし、M3学習は2つの主要な課題に直面しています。1）ラベル相関を効果的に利用する方法。 2）マルチモーダル学習を利用してラベルのないインスタンスを処理する方法。これらの問題を解決するために、まず、エンドツーエンドのマルチモーダルディープネットワークでのM3学習を考慮し、異なるモーダルバッグ間の一貫性の原則を利用する新しいマルチモーダルマルチインスタンスマルチラベルディープネットワーク（M3DN）を提案します。レベル予測。 M3DNに基づいて、最適な輸送を伴う潜在的な地盤ラベルメトリックを学習します。さらに、外部のラベルなしマルチモーダルマルチインスタンスデータを紹介し、モダリティ間の一貫性を強化するために、単一モダリティのインスタンスレベルのオートエンコーダと変更されたバッグレベルの最適なトランスポートを考慮するM3DNSを提案します。これにより、M3DNSはラベルをより正確に予測し、ラベルの相関関係を同時に活用できます。ベンチマークデータセットと実際のWKGGame-Hubデータセットでの実験により、提案された方法の有効性が検証されます。

Complex objects are usually with multiple labels, and can be represented by multiple modal representations, e.g., the complex articles contain text and image information as well as multiple annotations. Previous methods assume that the homogeneous multi-modal data are consistent, while in real applications, the raw data are disordered, e.g., the article constitutes with variable number of inconsistent text and image instances. Therefore, Multi-modal Multi-instance Multi-label (M3) learning provides a framework for handling such task and has exhibited excellent performance. However, M3 learning is facing two main challenges: 1) how to effectively utilize label correlation; 2) how to take advantage of multi-modal learning to process unlabeled instances. To solve these problems, we first propose a novel Multi-modal Multi-instance Multi-label Deep Network (M3DN), which considers M3 learning in an end-to-end multi-modal deep network and utilizes consistency principle among different modal bag-level predictions. Based on the M3DN, we learn the latent ground label metric with the optimal transport. Moreover, we introduce the extrinsic unlabeled multi-modal multi-instance data, and propose the M3DNS, which considers the instance-level auto-encoder for single modality and modified bag-level optimal transport to strengthen the consistency among modalities. Thereby M3DNS can better predict label and exploit label correlation simultaneously. Experiments on benchmark datasets and real world WKG Game-Hub dataset validate the effectiveness of the proposed methods.

updated: Sat Apr 17 2021 09:18:28 GMT+0000 (UTC)

published: Sat Apr 17 2021 09:18:28 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト