CV4Code: Sourcecode Understanding via Visual Code Representations

Ruibo Shi; Lili Tao; Rohan Saphal; Fran Silavong; Sean J. Moran

CV4Code：ビジュアルコード表現によるソースコードの理解

ソースコードを理解するためのコンパクトで効果的なコンピュータビジョン手法であるCV4Codeを紹介します。私たちの方法は、各スニペットを2次元画像として扱うことにより、コードスニペットから利用可能なコンテキスト情報と構造情報を活用します。これにより、コンテキストが自然にエンコードされ、明示的な空間表現を通じて基礎となる構造情報が保持されます。スニペットを画像としてコード化するために、ソースコード画像の高速生成を容易にし、RGBピクセル表現から生じるエンコーディングの冗長性を排除するASCIIコードポイントベースの画像表現を提案します。さらに、ソースコードは画像として扱われるため、字句解析（トークン化）も構文ツリーの解析も必要ありません。これにより、提案された方法は特定のプログラミング言語に依存せず、アプリケーションパイプラインの観点から軽量になります。 CV4Codeは、抽象構文木（AST）に依存するメソッドでは不可能な、構文的に正しくないコードを特徴づけることさえできます。畳み込みネットワークとTransformerネットワークを学習して、ソースコードの機能タスク、つまり解決する問題を2次元表現から直接予測し、潜在空間からの埋め込みを使用して類似性スコアを導出することにより、CV4Codeの有効性を示します。検索設定の2つのコードスニペット。実験結果は、私たちのアプローチが、同じタスクとデータ構成を持つ他の方法と比較して、最先端のパフォーマンスを達成することを示しています。ソースコードの理解を画像処理タスクの形式として扱うことの利点を初めて示します。

We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.

updated: Wed May 11 2022 13:02:35 GMT+0000 (UTC)

published: Wed May 11 2022 13:02:35 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト