HindSight: A Graph-Based Vision Model Architecture For Representing Part-Whole Hierarchies

Muhammad AbdurRafae

HindSight：部分全体の階層を表すためのグラフベースのビジョンモデルアーキテクチャ

このホワイトペーパーでは、画像内の部分全体の階層の表現をグラフ形式でエンコードするためのモデルアーキテクチャを紹介します。アイデアは、画像をさまざまなレベルのパッチに分割し、これらすべてのパッチを完全に接続されたグラフのノードとして扱うことです。動的特徴抽出モジュールは、各グラフの反復でこれらのパッチから特徴表現を抽出するために使用されます。これにより、固有の部分全体の階層情報を含む画像の豊富なグラフ表現を学習できます。適切な自己教師ありトレーニング手法を利用して、このようなモデルを汎用ビジョンエンコーダモデルとしてトレーニングし、さまざまなビジョン関連のダウンストリームタスク（画像分類、オブジェクト検出、画像キャプションなど）に使用できます。

This paper presents a model architecture for encoding the representations of part-whole hierarchies in images in form of a graph. The idea is to divide the image into patches of different levels and then treat all of these patches as nodes for a fully connected graph. A dynamic feature extraction module is used to extract feature representations from these patches in each graph iteration. This enables us to learn a rich graph representation of the image that encompasses the inherent part-whole hierarchical information. Utilizing proper self-supervised training techniques, such a model can be trained as a general purpose vision encoder model which can then be used for various vision related downstream tasks (e.g., Image Classification, Object Detection, Image Captioning, etc.).

updated: Thu Apr 08 2021 12:17:54 GMT+0000 (UTC)

published: Thu Apr 08 2021 12:17:54 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト