TVConv: Efficient Translation Variant Convolution for Layout-aware Visual Processing

Jierun Chen; Tianlang He; Weipeng Zhuo; Li Ma; Sangtae Ha; S. -H. Gary Chan

TVConv：レイアウトを意識した視覚処理のための効率的な翻訳バリアント畳み込み

畳み込みは多くのスマートアプリケーションに力を与えてきたため、動的畳み込みはさらに多様な入力に適応する機能を備えています。ただし、静的および動的な畳み込みは、レイアウトに依存しないか、計算量が多いため、顔認識や医療画像のセグメンテーションなど、レイアウト固有のアプリケーションには不適切です。これらのアプリケーションは、画像内（空間）分散が大きく、画像間分散が小さいという特性を自然に示すことがわかります。この観察結果は、レイアウトを意識した視覚処理のための効率的な翻訳バリアント畳み込み（TVConv）の動機となります。技術的には、TVConvはアフィニティマップと重み生成ブロックで構成されています。アフィニティマップはピクセルペアの関係を適切に表現しますが、効率的な推論を維持しながら、重みを生成するブロックを明示的にオーバーパラメータ化して、トレーニングを向上させることができます。概念的には単純ですが、TVConvは畳み込みの効率を大幅に向上させ、さまざまなネットワークアーキテクチャに簡単に接続できます。顔認識に関する広範な実験により、TVConvは、深さ方向の畳み込みと比較して高精度を維持しながら、計算コストを最大3.1倍削減し、対応するスループットを2.3倍改善することが示されています。さらに、同じ計算コストで、平均精度を最大4.21％向上させます。また、視神経乳頭/カップのセグメンテーションタスクの実験を行い、より優れた一般化パフォーマンスを取得します。これにより、重大なデータ不足の問題を軽減できます。コードはhttps://github.com/JierunChen/TVConvで入手できます。

As convolution has empowered many smart applications, dynamic convolution further equips it with the ability to adapt to diverse inputs. However, the static and dynamic convolutions are either layout-agnostic or computation-heavy, making it inappropriate for layout-specific applications, e.g., face recognition and medical image segmentation. We observe that these applications naturally exhibit the characteristics of large intra-image (spatial) variance and small cross-image variance. This observation motivates our efficient translation variant convolution (TVConv) for layout-aware visual processing. Technically, TVConv is composed of affinity maps and a weight-generating block. While affinity maps depict pixel-paired relationships gracefully, the weight-generating block can be explicitly overparameterized for better training while maintaining efficient inference. Although conceptually simple, TVConv significantly improves the efficiency of the convolution and can be readily plugged into various network architectures. Extensive experiments on face recognition show that TVConv reduces the computational cost by up to 3.1x and improves the corresponding throughput by 2.3x while maintaining a high accuracy compared to the depthwise convolution. Moreover, for the same computation cost, we boost the mean accuracy by up to 4.21%. We also conduct experiments on the optic disc/cup segmentation task and obtain better generalization performance, which helps mitigate the critical data scarcity issue. Code is available at https://github.com/JierunChen/TVConv.

updated: Tue Mar 22 2022 07:28:02 GMT+0000 (UTC)

published: Sun Mar 20 2022 08:29:06 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト