TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis

Yilan Zhang; Fengying Xie; Jianqi Chen; Jie Liu

TFormer: マルチモーダル皮膚病変診断のための全体融合変圧器

マルチモーダル皮膚病変診断 (MSLD) は、深い畳み込みに基づく最新のコンピューター支援診断技術によって目覚ましい成功を収めました。ただし、MSLD のモダリティ間の情報集約は、重度の位置合わせされていない空間解像度 (ダーモスコピー画像と臨床画像) と異種データ (ダーモスコピー画像と患者のメタデータ) のため、依然として困難です。固有の局所的注意によって制限されているため、純粋な畳み込みを使用する最新の MSLD パイプラインは、浅いレイヤーで代表的な特徴をキャプチャするのに苦労しています。集約。この問題に取り組むために、MSLD で十分な情報を統合するために、「Throughout Fusion Transformer (TFormer)」と呼ばれる純粋な変換器ベースの方法を導入します。畳み込みを使用した既存のアプローチとは異なり、提案されたネットワークは変換器を次のように活用します。特徴抽出バックボーン, より代表的な浅い特徴をもたらします. 次に、デュアルブランチ階層マルチモーダルトランスフォーマー (HMT) ブロックのスタックを慎重に設計して、段階ごとに異なる画像モダリティ間で情報を融合します.画像モダリティ, a multi-modal Transformer post-fusion (MTP) ブロックは、画像データと非画像データ全体の機能を統合するように設計されています. 画像モダリティの情報が最初に融合され、次に異種の情報が融合されるという戦略により、より適切に分割して処理することができます.モダリティ間ダイナミクスが効果的にモデル化されていることを確認しながら、2 つの主要な課題を克服します。提案手法の優位性当社の TFormer は、他の最先端の方法よりも優れています。アブレーション実験も、私たちの設計の有効性を示唆しています。

Multi-modal skin lesion diagnosis (MSLD) has achieved remarkable success by modern computer-aided diagnosis technology based on deep convolutions. However, the information aggregation across modalities in MSLD remains challenging due to severity unaligned spatial resolution (dermoscopic image and clinical image) and heterogeneous data (dermoscopic image and patients' meta-data). Limited by the intrinsic local attention, most recent MSLD pipelines using pure convolutions struggle to capture representative features in shallow layers, thus the fusion across different modalities is usually done at the end of the pipelines, even at the last layer, leading to an insufficient information aggregation. To tackle the issue, we introduce a pure transformer-based method, which we refer to as ``Throughout Fusion Transformer (TFormer)", for sufficient information intergration in MSLD. Different from the existing approaches with convolutions, the proposed network leverages transformer as feature extraction backbone, bringing more representative shallow features. We then carefully design a stack of dual-branch hierarchical multi-modal transformer (HMT) blocks to fuse information across different image modalities in a stage-by-stage way. With the aggregated information of image modalities, a multi-modal transformer post-fusion (MTP) block is designed to integrate features across image and non-image data. Such a strategy that information of the image modalities is firstly fused then the heterogeneous ones enables us to better divide and conquer the two major challenges while ensuring inter-modality dynamics are effectively modeled. Experiments conducted on the public Derm7pt dataset validate the superiority of the proposed method. Our TFormer outperforms other state-of-the-art methods. Ablation experiments also suggest the effectiveness of our designs.

updated: Mon Nov 21 2022 12:07:05 GMT+0000 (UTC)

published: Mon Nov 21 2022 12:07:05 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト