Cross-Modal Retrieval for Motion and Text via MildTriple Loss

Sheng Yan; Haoqiang Wang; Xin Du; Mengyuan Liu; Hong Liu

MildTriple Loss によるモーションとテキストのクロスモーダル検索

クロスモーダル検索は、画像テキスト検索技術とビデオテキスト検索技術の進歩により、コンピュータービジョンと自然言語処理における重要な研究テーマになりました。ただし、人間のモーションシーケンスとテキストの間のクロスモーダル検索は、ユーザーの行動や言語をよりよく理解するために仮想現実アプリケーションを支援するなど、広範なアプリケーションの価値があるにもかかわらず、十分な注目を集めていません。このタスクには、2 つのモダリティの共同モデリング、テキストからの人間中心の情報の理解、3D 人間の動作シーケンスからの行動特徴の学習など、いくつかの課題があります。モーションデータモデリングに関するこれまでの作業は、主に以前の情報を忘れる可能性のある自己回帰機能抽出器に依存していましたが、シンプルでありながら強力なトランスフォーマーベースのモーションエンコーダーとテキストエンコーダーを含む革新的なモデルを提案します。用語の依存関係。さらに、異なる人間の動きの同じアトミックアクションのオーバーラップは、意味上の競合を引き起こす可能性があり、新しいトリプレット損失関数である MildTriple Loss の調査につながります。イントラモーダルスペース内のサンプル間の類似性を活用して、ジョイント埋め込みスペース内のソフト/ハードネガティブサンプルマイニングをガイドし、トリプレットロスをトレーニングして、偽ネガティブサンプルによって引き起こされる違反を減らします。最新の HumanML3D および KIT Motion-Language データセットでモデルとメソッドを評価し、HumanML3D データセットでのモーション検索で 62.9% の再現率、テキスト検索で 71.5% の再現率 (R@10 に基づく) を達成しました。私たちのコードは https://github.com/eanson023/rehamot で入手できます。

Cross-modal retrieval has become a prominent research topic in computer vision and natural language processing with advances made in image-text and video-text retrieval technologies. However, cross-modal retrieval between human motion sequences and text has not garnered sufficient attention despite the extensive application value it holds, such as aiding virtual reality applications in better understanding users' actions and language. This task presents several challenges, including joint modeling of the two modalities, demanding the understanding of person-centered information from text, and learning behavior features from 3D human motion sequences. Previous work on motion data modeling mainly relied on autoregressive feature extractors that may forget previous information, while we propose an innovative model that includes simple yet powerful transformer-based motion and text encoders, which can learn representations from the two different modalities and capture long-term dependencies. Furthermore, the overlap of the same atomic actions of different human motions can cause semantic conflicts, leading us to explore a new triplet loss function, MildTriple Loss. it leverages the similarity between samples in intra-modal space to guide soft-hard negative sample mining in the joint embedding space to train the triplet loss and reduce the violation caused by false negative samples. We evaluated our model and method on the latest HumanML3D and KIT Motion-Language datasets, achieving a 62.9% recall for motion retrieval and a 71.5% recall for text retrieval (based on R@10) on the HumanML3D dataset. Our code is available at https://github.com/eanson023/rehamot.

updated: Mon Jul 17 2023 08:38:53 GMT+0000 (UTC)

published: Sun May 07 2023 05:40:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト