Improved Image Classification with Token Fusion

Keong Hun Choi; Jin Woo Kim; Yao Wang; Jong Eun Ha

Token Fusion による画像分類の改善

この論文では、CNN と変換器構造の融合を使用して画像分類性能を向上させる方法を提案します。 CNNの場合、画像上の局所的な情報はうまく抽出できますが、大局的な情報を抽出するには限界があります。一方、トランスフォーマーは、比較的大域的な抽出に有利ですが、局所的な特徴量の抽出に多くのメモリを必要とするという欠点があります。画像の場合、CNN を介して特徴マップに変換され、各特徴マップのピクセルがトークンと見なされます。同時に、画像はパッチ領域に分割され、トークンとして表示されるトランスフォーマーメソッドで融合されます。 2 つの異なる特性を持つトークンの融合のために、(1) 並列構造による後期トークン融合、(2) 早期トークン融合、(3) レイヤーごとのトークン融合の 3 つの方法を提案します。 ImageNet 1k を用いた実験では、提案手法が最良の分類性能を示した。

In this paper, we propose a method using the fusion of CNN and transformer structure to improve image classification performance. In the case of CNN, information about a local area on an image can be extracted well, but there is a limit to the extraction of global information. On the other hand, the transformer has an advantage in relatively global extraction, but has a disadvantage in that it requires a lot of memory for local feature value extraction. In the case of an image, it is converted into a feature map through CNN, and each feature map's pixel is considered a token. At the same time, the image is divided into patch areas and then fused with the transformer method that views them as tokens. For the fusion of tokens with two different characteristics, we propose three methods: (1) late token fusion with parallel structure, (2) early token fusion, (3) token fusion in a layer by layer. In an experiment using ImageNet 1k, the proposed method shows the best classification performance.

updated: Fri Aug 19 2022 07:02:50 GMT+0000 (UTC)

published: Fri Aug 19 2022 07:02:50 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト