D-Former: A U-shaped Dilated Transformer for 3D Medical Image Segmentation

Yixuan Wu; Kuanlun Liao; Jintai Chen; Jinhong Wang; Danny Z. Chen; Honghao Gao; Jian Wu

D-Former：3D医療画像セグメンテーション用のU字型拡張トランス

コンピュータ支援医療画像セグメンテーションは、標的臓器および組織の形状および体積の臨床的に有用な情報を取得するために、診断および治療に広く適用されてきました。過去数年間、畳み込みニューラルネットワーク（CNN）ベースの方法（U-Netなど）がこの領域を支配してきましたが、それでも不十分な長距離情報キャプチャに悩まされていました。したがって、最近の研究では、医療画像セグメンテーションタスク用のコンピュータビジョンTransformerバリアントが提示され、有望なパフォーマンスが得られました。このようなTransformerは、ペアごとのパッチ関係を計算することにより、長距離の依存関係をモデル化します。ただし、特に3D医用画像（CTやMRIなど）では、法外な計算コストが発生します。本論文では、ローカルスコープとグローバルスコープで交互にキャプチャされたペアワイズパッチ関係に対して自己注意を行う拡張トランスと呼ばれる新しい方法を提案します。拡張された畳み込みカーネルに触発されて、拡張された方法でグローバルな自己注意を実行し、関連するパッチを増やすことなく受容野を拡大し、計算コストを削減します。この拡張トランスの設計に基づいて、3D医療画像セグメンテーション用のD-Formerと呼ばれるU字型のエンコーダ-デコーダ階層アーキテクチャを構築します。 SynapseおよびACDCデータセットでの実験は、ゼロからトレーニングされたD-Formerモデルが、時間のかかるトレーニングごとのプロセスなしで、低い計算コストでさまざまな競合CNNベースまたはTransformerベースのセグメンテーションモデルよりも優れていることを示しています。

Computer-aided medical image segmentation has been applied widely in diagnosis and treatment to obtain clinically useful information of shapes and volumes of target organs and tissues. In the past several years, convolutional neural network (CNN) based methods (e.g., U-Net) have dominated this area, but still suffered from inadequate long-range information capturing. Hence, recent work presented computer vision Transformer variants for medical image segmentation tasks and obtained promising performances. Such Transformers model long-range dependency by computing pair-wise patch relations. However, they incur prohibitive computational costs, especially on 3D medical images (e.g., CT and MRI). In this paper, we propose a new method called Dilated Transformer, which conducts self-attention for pair-wise patch relations captured alternately in local and global scopes. Inspired by dilated convolution kernels, we conduct the global self-attention in a dilated manner, enlarging receptive fields without increasing the patches involved and thus reducing computational costs. Based on this design of Dilated Transformer, we construct a U-shaped encoder-decoder hierarchical architecture called D-Former for 3D medical image segmentation. Experiments on the Synapse and ACDC datasets show that our D-Former model, trained from scratch, outperforms various competitive CNN-based or Transformer-based segmentation models at a low computational cost without time-consuming per-training process.

updated: Mon Jan 10 2022 02:57:28 GMT+0000 (UTC)

published: Mon Jan 03 2022 03:20:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト