TriCoLo: Trimodal Contrastive Loss for Fine-grained Text to Shape Retrieval

Yue Ruan; Han-Hung Lee; Ke Zhang; Angel X. Chang

TriCoLo：検索を形作るためのきめの細かいテキストの三峰性の対照的な損失

マルチモーダルデータに対する共同埋め込みを学習するための対照的な損失に関する最近の研究は、検索や分類などの下流のタスクで成功しています。一方、3D形状とテキストの共同表現学習の研究は、これまで、表現間の複雑な注意のモデリング、またはマルチタスク学習による埋め込みの改善に主に焦点を当ててきました。大規模なバッチ対照学習により、複雑な注意メカニズムや損失なしにテキスト形状検索でSoTAを達成できることを示します。 3Dおよびテキスト表現のこれまでの研究では、ボクセルまたはテキスト付きのマルチビュー画像のいずれかを使用したバイモーダル表現学習にも焦点が当てられていました。この目的のために、すべてのモダリティに対してさらに高いパフォーマンスとより良い表現を実現するための3モーダル学習スキームを提案します。

Recent work on contrastive losses for learning joint embeddings over multimodal data has been successful at downstream tasks such as retrieval and classification. On the other hand, work on joint representation learning for 3D shapes and text has thus far mostly focused on improving embeddings through modeling of complex attention between representations , or multi-task learning . We show that with large batch contrastive learning we achieve SoTA on text-shape retrieval without complex attention mechanisms or losses. Prior work in 3D and text representations has also focused on bimodal representation learning using either voxels or multi-view images with text. To this end, we propose a trimodal learning scheme to achieve even higher performance and better representations for all modalities.

updated: Wed Jan 19 2022 00:15:15 GMT+0000 (UTC)

published: Wed Jan 19 2022 00:15:15 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト