Disentangling Neuron Representations with Concept Vectors

Laura O'Mahony; Vincent Andrearczyk; Henning Muller; Mara Graziani

概念ベクトルによるニューロン表現のもつれの解消

機械的解釈可能性は、ニューラルネットワークを解釈可能な単位に分解することにより、モデルがどのように表現を格納するかを理解することを目的としています。ただし、多意味ニューロン、または複数の無関係な機能に応答するニューロンの発生により、個々のニューロンの解釈が困難になります。これにより、個々のニューロンではなく活性化空間で、概念ベクトルとして知られる意味のあるベクトルが検索されました。この論文の主な貢献は、多義的ニューロンのもつれを、異なる特徴をカプセル化した概念ベクトルに解きほぐす方法です。私たちの方法は、ユーザーが望む概念分離のレベルに従って、きめの細かい概念を検索できます。分析は、多意味ニューロンが、ニューロンの線形結合からなる方向に分解できることを示しています。私たちの評価は、見つかった概念ベクトルが一貫した、人間が理解できる機能をエンコードしていることを示しています。

Mechanistic interpretability aims to understand how models store representations by breaking down neural networks into interpretable units. However, the occurrence of polysemantic neurons, or neurons that respond to multiple unrelated features, makes interpreting individual neurons challenging. This has led to the search for meaningful vectors, known as concept vectors, in activation space instead of individual neurons. The main contribution of this paper is a method to disentangle polysemantic neurons into concept vectors encapsulating distinct features. Our method can search for fine-grained concepts according to the user's desired level of concept separation. The analysis shows that polysemantic neurons can be disentangled into directions consisting of linear combinations of neurons. Our evaluations show that the concept vectors found encode coherent, human-understandable features.

updated: Wed Apr 19 2023 14:55:31 GMT+0000 (UTC)

published: Wed Apr 19 2023 14:55:31 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト