Learning Rich Representations For Structured Visual Prediction Tasks

構造化された視覚的予測タスクのための豊富な表現の学習

画像の豊富な表現を学習するアプローチについて説明します。これにより、空間的に構造化されたマップを含むさまざまな視覚タスクで、シンプルで効果的な予測が可能になります。私たちの重要なアイデアは、小さな画像要素を、空間的範囲が増加するネスト領域のシーケンスから抽出された特徴表現にマッピングすることです。これらの領域は、ピクセル/スーパーピクセルからシーンレベルの解像度まで「ズームアウト」することで取得されるため、これらのズームアウト機能と呼ばれます。セマンティックセグメンテーションおよびその他の構造化予測タスクに適用されるこのアプローチは、明示的な構造化予測メカニズムを設定せずに、画像およびラベル空間の統計構造を活用し、複雑で高価な推論を回避します。代わりに、画像要素は、ズームアウトレベルにまたがるスキップレイヤー接続を備えたフィードフォワード多層ネットワークによって分類されます。 ResNet、DenseNet、NASNetなどの最新のニューラルアーキテクチャと組み合わせて使用すると（補完的）、セグメンテーションベンチマークで競争力のある精度を実現します。さらに、純粋に画像レベルの分類タグからカテゴリレベルのセマンティックセグメンテーションを学習するためのアプローチを提案します。分類タスク用に調整された変更されたズームアウトアーキテクチャのトレーニングから生じるローカライズキューを活用して、対象クラスに属する可能性があるまばらで多様なトレーニングセットを自動的にラベル付けする弱監視プロセスを駆動します。最後に、CNNの教師付きトレーニング用のデータ駆動型の正則化関数を紹介します。私たちの革新は、一連の注釈についてオートエンコーダーを学習することで派生したレギュラーの形を取ります。このアプローチは、ラベル空間の改善された表現を活用して、画像からの特徴の抽出を通知します

We describe an approach to learning rich representations for images, that enables simple and effective predictors in a range of vision tasks involving spatially structured maps. Our key idea is to map small image elements to feature representations extracted from a sequence of nested regions of increasing spatial extent. These regions are obtained by "zooming out" from the pixel/superpixel all the way to scene-level resolution, and hence we call these zoom-out features. Applied to semantic segmentation and other structured prediction tasks, our approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead image elements are classified by a feedforward multilayer network with skip-layer connections spanning the zoom-out levels. When used in conjunction with modern neural architectures such as ResNet, DenseNet and NASNet (to which it is complementary) our approach achieves competitive accuracy on segmentation benchmarks. In addition, we propose an approach for learning category-level semantic segmentation purely from image-level classification tag. It exploits localization cues that emerge from training a modified zoom-out architecture tailored for classification tasks, to drive a weakly supervised process that automatically labels a sparse, diverse training set of points likely to belong to classes of interest. Finally, we introduce data-driven regularization functions for the supervised training of CNNs. Our innovation takes the form of a regularizer derived by learning an autoencoder over the set of annotations. This approach leverages an improved representation of label space to inform extraction of features from images

updated: Fri Aug 30 2019 16:18:26 GMT+0000 (UTC)

published: Fri Aug 30 2019 16:18:26 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト