FocusFormer: Focusing on What We Need via Architecture Sampler

Jing Liu; Jianfei Cai; Bohan Zhuang

FocusFormer: アーキテクチャサンプラーを介して必要なものに焦点を当てる

ビジョントランスフォーマー (ViT) は、コンピュータービジョンにおける最近のブレークスルーを支えてきました。ただし、ViT のアーキテクチャの設計は面倒であり、専門知識に大きく依存しています。設計プロセスを自動化し、展開の柔軟性を組み込むために、ワンショットニューラルアーキテクチャ検索は、さまざまな展開シナリオのスーパーネットトレーニングとアーキテクチャの専門化を分離します。スーパーネット内の膨大な数のサブネットワークに対処するために、既存の方法では、すべてのアーキテクチャを同等に重要に扱い、トレーニング中の各更新ステップでそれらの一部をランダムにサンプリングします。アーキテクチャ検索中、これらの方法は、トレーニングと展開の間にギャップを形成するパフォーマンスとリソース消費のパレートフロンティアでアーキテクチャを見つけることに重点を置いています。この論文では、このようなギャップを埋めるために、FocusFormer と呼ばれるシンプルで効果的な方法を考案します。この目的のために、アーキテクチャサンプラーを学習して、スーパーネットトレーニング中にさまざまなリソース制約の下でパレートフロンティア上のアーキテクチャに高いサンプリング確率を割り当てることを提案します。これにより、それらが十分に最適化され、パフォーマンスが向上します。専門化中に、十分にトレーニングされたアーキテクチャサンプラーを直接使用して、特定のリソース制約を満たす正確なアーキテクチャを取得できます。これにより、検索効率が大幅に向上します。 CIFAR-100 と ImageNet での広範な実験は、FocusFormer が検索コストを大幅に削減しながら、検索されたアーキテクチャのパフォーマンスを向上できることを示しています。たとえば、ImageNet では、1.4G FLOP を備えた当社の FocusFormer-Ti は、トップ 1 精度に関して AutoFormer-Ti を 0.5% 上回っています。

Vision Transformers (ViTs) have underpinned the recent breakthroughs in computer vision. However, designing the architectures of ViTs is laborious and heavily relies on expert knowledge. To automate the design process and incorporate deployment flexibility, one-shot neural architecture search decouples the supernet training and architecture specialization for diverse deployment scenarios. To cope with an enormous number of sub-networks in the supernet, existing methods treat all architectures equally important and randomly sample some of them in each update step during training. During architecture search, these methods focus on finding architectures on the Pareto frontier of performance and resource consumption, which forms a gap between training and deployment. In this paper, we devise a simple yet effective method, called FocusFormer, to bridge such a gap. To this end, we propose to learn an architecture sampler to assign higher sampling probabilities to those architectures on the Pareto frontier under different resource constraints during supernet training, making them sufficiently optimized and hence improving their performance. During specialization, we can directly use the well-trained architecture sampler to obtain accurate architectures satisfying the given resource constraint, which significantly improves the search efficiency. Extensive experiments on CIFAR-100 and ImageNet show that our FocusFormer is able to improve the performance of the searched architectures while significantly reducing the search cost. For example, on ImageNet, our FocusFormer-Ti with 1.4G FLOPs outperforms AutoFormer-Ti by 0.5% in terms of the Top-1 accuracy.

updated: Tue Aug 23 2022 10:42:56 GMT+0000 (UTC)

published: Tue Aug 23 2022 10:42:56 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト