Person Text-Image Matching via Text-Feature Interpretability Embedding and External Attack Node Implantation

Fan Li; Hang Zhou; Huafeng Li; Yafei Zhang; Zhengtao Yu

Text-Feature Interpretability EmbeddingとExternal Attack Node ImplantationによるPerson Text-Image Matching

テキストベースの人物検索とも呼ばれる、人物のテキストと画像のマッチングは、テキストの説明を使用して特定の歩行者の画像を取得することを目的としています。人物のテキストと画像のマッチングは大きな研究進歩を遂げましたが、既存の方法はまだ 2 つの課題に直面しています。まず、テキストの特徴の解釈可能性の欠如により、それらを対応する画像の特徴と効果的に整列させることが困難になります。第 2 に、同じ歩行者の画像が複数の異なるテキスト記述に対応していることが多く、1 つのテキスト記述が同じ ID の複数の異なる画像に対応している可能性があります。テキストの説明と画像の多様性により、ネットワークが 2 つのモダリティに一致する堅牢な特徴を抽出することが困難になります。これらの問題に対処するために、テキスト特徴の解釈可能性と外部攻撃ノードを埋め込むことにより、人物テキスト画像マッチング方法を提案します。具体的には、テキストの位置合わせを実現し、画像領域の特徴を記述するために、一貫したセマンティック情報を画像の特徴とともに提供することにより、テキストの特徴の解釈可能性を向上させます。テキストと対応する人物の画像の多様性によってもたらされる課題に対処するために、多様性による変動を摂動情報による特徴に変換し、それを解決するための新しい敵対的攻撃および防御方法を提案します。モデル設計では、特徴表現の基本フレームワークとしてグラフ畳み込みが使用され、特徴抽出でのテキストと画像の多様性によって引き起こされる敵対的攻撃は、グラフ畳み込みレイヤーに追加の攻撃ノードを埋め込むことでシミュレートされ、モデルのロバスト性が向上します。テキストと画像の多様性。広範な実験により、既存の方法に対するテキストと歩行者の画像マッチングの有効性と優位性が実証されています。メソッドのソースコードは、次の場所で公開されています。

Person text-image matching, also known as text based person search, aims to retrieve images of specific pedestrians using text descriptions. Although person text-image matching has made great research progress, existing methods still face two challenges. First, the lack of interpretability of text features makes it challenging to effectively align them with their corresponding image features. Second, the same pedestrian image often corresponds to multiple different text descriptions, and a single text description can correspond to multiple different images of the same identity. The diversity of text descriptions and images makes it difficult for a network to extract robust features that match the two modalities. To address these problems, we propose a person text-image matching method by embedding text-feature interpretability and an external attack node. Specifically, we improve the interpretability of text features by providing them with consistent semantic information with image features to achieve the alignment of text and describe image region features.To address the challenges posed by the diversity of text and the corresponding person images, we treat the variation caused by diversity to features as caused by perturbation information and propose a novel adversarial attack and defense method to solve it. In the model design, graph convolution is used as the basic framework for feature representation and the adversarial attacks caused by text and image diversity on feature extraction is simulated by implanting an additional attack node in the graph convolution layer to improve the robustness of the model against text and image diversity. Extensive experiments demonstrate the effectiveness and superiority of text-pedestrian image matching over existing methods. The source code of the method is published at

updated: Sat Nov 19 2022 03:55:55 GMT+0000 (UTC)

published: Wed Nov 16 2022 04:15:37 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト