TFormer: A Transmission-Friendly ViT Model for IoT Devices

Zhichao Lu; Chuntao Ding; Felix Juefei-Xu; Vishnu Naresh Boddeti; Shangguang Wang; Yun Yang

TFormer: IoT デバイス向けの伝送に適した ViT モデル

ユビキタスなモノのインターネット (IoT) デバイスに高性能のビジョントランスフォーマー (ViT) モデルを展開して、高品質のビジョンサービスを提供することで、私たちの生活、仕事、世界との関わり方に革命がもたらされます。 IoT デバイスの限られたリソースと、リソース集約型の ViT モデルとの間の矛盾により、クラウドサーバーを使用して ViT モデルのトレーニングを支援することが主流になりました。ただし、既存の ViT モデルのパラメーターと浮動小数点演算 (FLOP) の数が多いため、クラウドサーバーによって送信されるモデルパラメーターは大きく、リソースに制約のある IoT デバイスで実行することは困難です。この目的のために、このホワイトペーパーでは、クラウドサーバーの支援を受けて、リソースに制約のある IoT デバイスに展開するための、伝送に適した ViT モデル TFormer を提案します。 TFormer の高性能で少数のモデルパラメーターと FLOP は、提案されたハイブリッドレイヤーと提案された部分接続フィードフォワードネットワーク (PCS-FFN) によるものです。ハイブリッド層は、学習不可能なモジュールとポイントごとの畳み込みで構成され、わずかなパラメーターと FLOP でマルチタイプおよびマルチスケールの機能を取得して、TFormer のパフォーマンスを向上させることができます。 PCS-FFN では、グループ畳み込みを採用してパラメータ数を削減しています。このホワイトペーパーの重要なアイデアは、リソースに制約のある IoT デバイス上で実行されるアプリケーションが ViT モデルの高いパフォーマンスを活用できるように、いくつかのモデルパラメーターと FLOP を使用して TFormer を提案することです。画像分類、オブジェクト検出、セマンティックセグメンテーションタスクに関する ImageNet-1K、MS COCO、および ADE20K データセットの実験結果は、提案されたモデルが他の最先端のモデルより優れていることを示しています。具体的には、TFormer-S は ImageNet-1K で、ResNet18 よりも 1.4 分の 1 少ないパラメーターと FLOP で 5% 高い精度を実現します。

Deploying high-performance vision transformer (ViT) models on ubiquitous Internet of Things (IoT) devices to provide high-quality vision services will revolutionize the way we live, work, and interact with the world. Due to the contradiction between the limited resources of IoT devices and resource-intensive ViT models, the use of cloud servers to assist ViT model training has become mainstream. However, due to the larger number of parameters and floating-point operations (FLOPs) of the existing ViT models, the model parameters transmitted by cloud servers are large and difficult to run on resource-constrained IoT devices. To this end, this paper proposes a transmission-friendly ViT model, TFormer, for deployment on resource-constrained IoT devices with the assistance of a cloud server. The high performance and small number of model parameters and FLOPs of TFormer are attributed to the proposed hybrid layer and the proposed partially connected feed-forward network (PCS-FFN). The hybrid layer consists of nonlearnable modules and a pointwise convolution, which can obtain multitype and multiscale features with only a few parameters and FLOPs to improve the TFormer performance. The PCS-FFN adopts group convolution to reduce the number of parameters. The key idea of this paper is to propose TFormer with few model parameters and FLOPs to facilitate applications running on resource-constrained IoT devices to benefit from the high performance of the ViT models. Experimental results on the ImageNet-1K, MS COCO, and ADE20K datasets for image classification, object detection, and semantic segmentation tasks demonstrate that the proposed model outperforms other state-of-the-art models. Specifically, TFormer-S achieves 5% higher accuracy on ImageNet-1K than ResNet18 with 1.4× fewer parameters and FLOPs.

updated: Wed Feb 15 2023 15:36:10 GMT+0000 (UTC)

published: Wed Feb 15 2023 15:36:10 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト