PE-former: Pose Estimation Transformer

Paschalis Panteleris; Antonis Argyros

PE-former：ポーズ推定トランスフォーマー

ビジョントランスアーキテクチャは、画像分類タスクで非常に効果的に機能することが実証されています。トランスを使用してより困難なビジョンタスクを解決するための取り組みは、特徴抽出のための畳み込みバックボーンに依存しています。この論文では、2Dボディポーズ推定の問題に対して、純粋なトランスアーキテクチャ（つまり、CNNバックボーンを持たないアーキテクチャ）の使用を調査します。 COCOデータセットで2つのViTアーキテクチャを評価します。エンコーダー-デコーダートランスフォーマーアーキテクチャを使用すると、この推定問題に関する最先端の結果が得られることを示します。

Vision transformer architectures have been demonstrated to work very effectively for image classification tasks. Efforts to solve more challenging vision tasks with transformers rely on convolutional backbones for feature extraction. In this paper we investigate the use of a pure transformer architecture (i.e., one with no CNN backbone) for the problem of 2D body pose estimation. We evaluate two ViT architectures on the COCO dataset. We demonstrate that using an encoder-decoder transformer architecture yields state of the art results on this estimation problem.

updated: Thu Dec 09 2021 15:20:23 GMT+0000 (UTC)

published: Thu Dec 09 2021 15:20:23 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト