X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics

Yehao Li; Yingwei Pan; Jingwen Chen; Ting Yao; Tao Mei

X-modaler：クロスモーダル分析のための多用途で高性能なコードベース

過去10年間のディープラーニングの台頭と発展に伴い、マルチメディア分野におけるビジョンと言語の間の最先端のクロスモーダル分析を説得力を持って推進する革新とブレークスルーの着実な勢いがありました。それにもかかわらず、統合されたモジュール方式でのクロスモーダル分析のための多数のニューラルネットワークモデルのトレーニングと展開をサポートするオープンソースのコードベースはありませんでした。この作業では、X-modalerを提案します。これは、最先端のクロスモーダル分析をいくつかの汎用ステージ（前処理、エンコーダー、クロスモーダルなど）にカプセル化する多用途で高性能なコードベースです。相互作用、デコーダー、およびデコード戦略）。各ステージは、最先端で広く採用されている一連のモジュールをカバーする機能を備えており、シームレスな切り替えが可能です。このようにして、研究コミュニティの急速な発展を促進することを目的として、画像キャプション、ビデオキャプション、および視覚言語の事前トレーニングのための最先端のアルゴリズムの柔軟な実装が自然に可能になります。一方、いくつかの段階での効果的なモジュラー設計（クロスモーダルインタラクションなど）は、さまざまなビジョン言語タスク間で共有されるため、Xモーダルを拡張して、視覚的な質問を含むクロスモーダル分析の他のタスクのスタートアッププロトタイプを強化できます。回答、視覚的常識推論、およびクロスモーダル検索。 X-modalerはApacheライセンスのコードベースであり、そのソースコード、サンプルプロジェクト、および事前トレーニング済みモデルはオンラインで入手できます：https：//github.com/YehLi/xmodaler。

With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler -- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development of research community. Meanwhile, since the effective modular designs in several stages (e.g., cross-modal interaction) are shared across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed codebase, and its source codes, sample projects and pre-trained models are available on-line: https://github.com/YehLi/xmodaler.

updated: Wed Aug 18 2021 16:05:30 GMT+0000 (UTC)

published: Wed Aug 18 2021 16:05:30 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト