Fashionformer: A simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition

Shilin Xu; Xiangtai Li; Jingbo Wang; Guangliang Cheng; Yunhai Tong; Dacheng Tao

Fashionformer：人間のファッションのセグメンテーションと認識のためのシンプルで効果的で統一されたベースライン

人間のファッションの理解は、実際のアプリケーションに使用できる包括的な情報を持っているため、重要なコンピュータービジョンのタスクの1つです。この作業では、人間のファッションの共同セグメンテーションと属性認識に焦点を当てます。各タスクをマルチヘッド予測問題として個別にモデル化する以前の作業とは異なり、私たちの洞察は、ビジョントランスモデリングを介してこれら2つのタスクを1つの統合モデルと橋渡しし、各タスクに利益をもたらすことです。特に、セグメンテーションのためのオブジェクトクエリと属性予測のための属性クエリを紹介します。クエリとそれに対応する機能の両方を、マスク予測を介してリンクできます。次に、2ストリームのクエリ学習フレームワークを採用して、分離されたクエリ表現を学習します。属性ストリームについては、よりきめ細かい機能を探索するための新しいマルチレイヤーレンダリングモジュールを設計します。デコーダーの設計はDETRと同じ精神を共有しているため、提案された方法をFahsionformerと名付けます。 Fashionpedia、ModaNet、Deepfashionを含む3つの人間のファッションデータセットに関する広範な実験は、私たちのアプローチの有効性を示しています。特に、同じバックボーンを使用する方法では、\ textit {セグメンテーションと属性認識の両方のジョイントメトリック（AP ^ mask_IoU + F_1）}の場合、以前の作業よりも比較的10％向上します。私たちの知る限りでは、私たちは人間のファッション分析のための最初の統合されたエンドツーエンドのビジョントランスフレームワークです。このシンプルで効果的な方法が、ファッション分析の新しい柔軟なベースラインとして役立つことを願っています。コードはhttps://github.com/xushilin1/FashionFormerで入手できます。

Human fashion understanding is one important computer vision task since it has the comprehensive information that can be used for real-world applications. In this work, we focus on joint human fashion segmentation and attribute recognition. Contrary to the previous works that separately model each task as a multi-head prediction problem, our insight is to bridge these two tasks with one unified model via vision transformer modeling to benefit each task. In particular, we introduce the object query for segmentation and the attribute query for attribute prediction. Both queries and their corresponding features can be linked via mask prediction. Then we adopt a two-stream query learning framework to learn the decoupled query representations. For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features. The decoder design shares the same spirits with DETR, thus we name the proposed method Fahsionformer. Extensive experiments on three human fashion datasets including Fashionpedia, ModaNet and Deepfashion illustrate the effectiveness of our approach. In particular, our method with the same backbone achieve relative 10% improvements than previous works in case of \textit{a joint metric ( AP^mask_IoU+F_1) for both segmentation and attribute recognition}. To the best of our knowledge, we are the first unified end-to-end vision transformer framework for human fashion analysis. We hope this simple yet effective method can serve as a new flexible baseline for fashion analysis. Code will be available at https://github.com/xushilin1/FashionFormer.

updated: Sun Apr 10 2022 11:11:10 GMT+0000 (UTC)

published: Sun Apr 10 2022 11:11:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト