CCNet: Criss-Cross Attention for Semantic Segmentation

Zilong Huang; Xinggang Wang; Yunchao Wei; Lichao Huang; Humphrey Shi; Wenyu Liu; Thomas S. Huang

CCNet：セマンティックセグメンテーションのための重要な注意

コンテキスト情報は、セマンティックセグメンテーションやオブジェクト検出などの視覚的な理解の問題に不可欠です。非常に効果的かつ効率的な方法で全画像のコンテキスト情報を取得するために、Criss-Cross Network（CCNet）を提案します。具体的には、各ピクセルについて、新しい十字型注意モジュールが、その十字型パス上のすべてのピクセルのコンテキスト情報を収集します。さらに繰り返し操作を行うことにより、各ピクセルは最終的にフルイメージの依存関係をキャプチャできます。加えて、カテゴリー一貫性のある損失は、十字型注意モジュールを強制して、より特徴的な機能を生成するように提案されています。全体として、CCNetには次のメリットがあります。1）GPUメモリフレンドリー。非ローカルブロックと比較して、提案された繰り返し交差注意モジュールでは、GPUメモリ使用量が11分の1です。 2）高い計算効率。繰り返し交差注意がFLOPを非ローカルブロックの約85％大幅に削減します。 3）最先端のパフォーマンス。 Cityscapes、ADE20K、人間の解析ベンチマークLIP、インスタンスセグメンテーションベンチマークCOCO、ビデオセグメンテーションベンチマークCamVidなどのセマンティックセグメンテーションベンチマークについて、広範な実験を行っています。特に、CCNetは、Cityscapesテストセット、ADE20K検証セット、およびLIP検証セットでそれぞれ、81.9％、45.76％、55.47％のmIoUスコアを達成しています。これらは、最新の結果です。ソースコードはhttps://github.com/speedinghzl/CCNetで入手できます。

Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at https://github.com/speedinghzl/CCNet.

updated: Thu Jul 09 2020 12:17:28 GMT+0000 (UTC)

published: Wed Nov 28 2018 18:18:27 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト