Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead

Arian Bakhtiarnia; Qi Zhang; Alexandros Iosifidis

オーバーヘッドの少ない、より正確な早期終了のための単層ビジョントランスフォーマー

エッジコンピューティングシステムやIoTネットワークなど、計算リソースが限られているタイムクリティカルなアプリケーションにディープラーニングモデルを展開することは、早期終了などの動的推論方法に依存することが多い困難なタスクです。このホワイトペーパーでは、ビジョントランスフォーマーアーキテクチャに基づく早期終了の新しいアーキテクチャと、従来のアプローチと比較して早期終了ブランチの精度を大幅に向上させ、オーバーヘッドを削減する微調整戦略を紹介します。画像と音声の分類、および視聴覚群集のカウントに関する広範な実験を通じて、私たちの方法が分類と回帰の両方の問題、およびシングルモーダル設定とマルチモーダル設定の両方で機能することを示します。さらに、視聴覚データ分析の初期の出口内で音声と視覚のモダリティを統合するための新しい方法を紹介します。これにより、よりきめ細かい動的推論が可能になります。

Deploying deep learning models in time-critical applications with limited computational resources, for instance in edge computing systems and IoT networks, is a challenging task that often relies on dynamic inference methods such as early exiting. In this paper, we introduce a novel architecture for early exiting based on the vision transformer architecture, as well as a fine-tuning strategy that significantly increase the accuracy of early exit branches compared to conventional approaches while introducing less overhead. Through extensive experiments on image and audio classification as well as audiovisual crowd counting, we show that our method works for both classification and regression problems, and in both single- and multi-modal settings. Additionally, we introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis, that can lead to a more fine-grained dynamic inference.

updated: Wed May 19 2021 13:30:34 GMT+0000 (UTC)

published: Wed May 19 2021 13:30:34 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト