RoI Tanh-polar Transformer Network for Face Parsing in the Wild

Yiming Lin; Jie Shen; Yujiang Wang; Maja Pantic

野生の顔解析のためのRoITanh-polarTransformerネットワーク

顔の解析は、画像内のターゲット顔の顔コンポーネントのピクセル単位のラベルを予測することを目的としています。既存のアプローチは通常、前処理中に計算されたバウンディングボックスに関して入力画像からターゲットの顔をトリミングするため、顔の内側の関心領域〜（RoI）のみを解析できます。髪の毛のような周辺領域は無視され、バウンディングボックスに部分的に含まれている近くの顔は気を散らす原因となる可能性があります。さらに、これらの方法は、正面近くのポートレート画像でのみトレーニングおよび評価されているため、野生の場合のパフォーマンスは調査されていません。これらの問題に対処するために、このペーパーは3つの貢献をします。まず、21,866枚のトレーニング画像と1,000枚のテスト画像で構成される野生の顔解析用のiBugMaskデータセットを紹介します。トレーニング画像は、既存のデータセットを大きな顔のポーズで補強することによって取得されます。テスト画像には11の顔領域が手動で注釈されており、サイズ、ポーズ、表情、背景には大きなばらつきがあります。次に、RoI Tanh極変換を提案します。これは、ターゲットの境界ボックスによってガイドされ、顔領域とコンテキストの比率が固定されたTanh極表現に画像全体をワープします。新しい表現には、元の画像のすべての情報が含まれ、畳み込みニューラルネットワーク〜（CNN）での回転同変が可能になります。第3に、Tanh極空間とTanh-Cartesian空間の両方に畳み込み層を含み、CNNでさまざまな形状の受容野を可能にする、ハイブリッド残差表現学習ブロック、造語HybridBlockを提案します。広範な実験を通じて、提案された方法が野生での顔解析の最先端を改善し、位置合わせのために顔のランドマークを必要としないことを示します。

Face parsing aims to predict pixel-wise labels for facial components of a target face in an image. Existing approaches usually crop the target face from the input image with respect to a bounding box calculated during pre-processing, and thus can only parse inner facial Regions of Interest~(RoIs). Peripheral regions like hair are ignored and nearby faces that are partially included in the bounding box can cause distractions. Moreover, these methods are only trained and evaluated on near-frontal portrait images and thus their performance for in-the-wild cases has been unexplored. To address these issues, this paper makes three contributions. First, we introduce iBugMask dataset for face parsing in the wild, which consists of 21,866 training images and 1,000 testing images. The training images are obtained by augmenting an existing dataset with large face poses. The testing images are manually annotated with 11 facial regions and there are large variations in sizes, poses, expressions and background. Second, we propose RoI Tanh-polar transform that warps the whole image to a Tanh-polar representation with a fixed ratio between the face area and the context, guided by the target bounding box. The new representation contains all information in the original image, and allows for rotation equivariance in the convolutional neural networks~(CNNs). Third, we propose a hybrid residual representation learning block, coined HybridBlock, that contains convolutional layers in both the Tanh-polar space and the Tanh-Cartesian space, allowing for receptive fields of different shapes in CNNs. Through extensive experiments, we show that the proposed method improves the state-of-the-art for face parsing in the wild and does not require facial landmarks for alignment.

updated: Wed May 05 2021 16:27:24 GMT+0000 (UTC)

published: Thu Feb 04 2021 16:25:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト