Weaponizing Unicodes with Deep Learning -- Identifying Homoglyphs with Weakly Labeled Data

Perry Deng; Cooper Linsky; Matthew Wright

ディープラーニングによるUnicodeの武器化-弱くラベル付けされたデータによるホモグリフの識別

視覚的に類似した文字またはホモグリフを使用して、ソーシャルエンジニアリング攻撃を実行したり、スパムや盗用の検出器を回避したりできます。したがって、ホモグリフ（特に以前に発見されていないもの）を識別し、それらを攻撃に利用する攻撃者の能力を理解することが重要です。埋め込み学習、転移学習、および拡張を使用して深層学習モデルを調査し、文字の視覚的な類似性を判断して、潜在的なホモグリフを特定します。私たちのアプローチは、ほとんどの文字がホモグリフではないという事実から生じる弱いラベルを独自に利用しています。私たちのモデルは、ペアワイズホモグリフ識別で正規化された圧縮距離アプローチを大幅に上回り、平均精度0.97を達成しています。また、ホモグリフを同値類のセットにクラスタリングする最初の試みを示します。これは、セキュリティ担当者がホモグリフをすばやく検索したり、紛らわしい文字列エンコーディングを正規化したりするためのペアワイズ情報よりも効率的です。クラスタリングのパフォーマンスを測定するために、従来のIntersection-Over-Union（IOU）メトリックに基づいたメトリック（mBIOU）を提案します。私たちのクラスタリング手法は、ナイーブベースラインの0.430と比較して、0.592mBIOUを達成します。また、このモデルを使用して、これまで知られていなかった8,000を超えるホモグリフを予測し、これらの多くが真の陽性である可能性があることを早期に示す良い兆候を見つけます。ソースコードと予測されるホモグリフのリストがGithubにアップロードされます：https：//github.com/PerryXDeng/weaponizing_unicode

Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs -- particularly ones that have not been previously spotted -- and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normalized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is more efficient than pairwise information for security practitioners to quickly lookup homoglyphs or to normalize confusable string encodings. To measure clustering performance, we propose a metric (mBIOU) building on the classic Intersection-Over-Union (IOU) metric. Our clustering method achieves 0.592 mBIOU, compared to 0.430 for the naive baseline. We also use our model to predict over 8,000 previously unknown homoglyphs, and find good early indications that many of these may be true positives. Source code and list of predicted homoglyphs are uploaded to Github: https://github.com/PerryXDeng/weaponizing_unicode

updated: Tue Dec 22 2020 18:11:46 GMT+0000 (UTC)

published: Fri Oct 09 2020 06:03:18 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト