WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

Conghui He; Zhenjiang Jin; Chao Xu; Jiantao Qiu; Bin Wang; Wei Li; Hang Yan; JiaQi Wang; Dahua Lin

WanJuan: 英語と中国語の大規模モデルを進歩させるための包括的なマルチモーダルデータセット

ChatGPT と GPT-4 の人気の高まりにより、大規模モデルの開発が大幅に加速され、多数の優れた大規模言語モデル (LLM) やマルチモーダル大規模言語モデル (MLLM) の作成につながりました。これらの最先端モデルの優れたパフォーマンスは、高品質のデータによるものです。ただし、主要なパラダイムで使用されるトレーニングデータの詳細は機密に保たれることがよくあります。この透明性の欠如は、オープンソースデータの不足と相まって、コミュニティ内のさらなる発展を妨げています。これに対する回答として、このホワイトペーパーでは、広範囲の Web ソースから収集された中国語と英語の両方のデータで構成される大規模なマルチモーダルデータセットである「Wan Juan」を紹介します。データセットにはテキスト、画像テキスト、ビデオモダリティが組み込まれており、総量は 2 TB を超えます。これは、同様の規模のモデルと比較した場合に、多次元評価において大きな利点を実証したモデルである InternLM のトレーニングに利用されました。すべてのデータは https://opendatalab.org.cn/WanJuan1.0 でアクセスできます。

The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.

updated: Mon Aug 21 2023 14:40:48 GMT+0000 (UTC)

published: Mon Aug 21 2023 14:40:48 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト