Scaling Laws for Autoregressive Generative Modeling

Tom Henighan; Jared Kaplan; Mor Katz; Mark Chen; Christopher Hesse; Jacob Jackson; Heewoo Jun; Tom B. Brown; Prafulla Dhariwal; Scott Gray; Chris Hallacy; Benjamin Mann; Alec Radford; Aditya Ramesh; Nick Ryder; Daniel M. Ziegler; John Schulman; Dario Amodei; Sam McCandlish

自己回帰生成モデリングのスケーリング則

生成画像モデリング、ビデオモデリング、マルチモーダル画像\左右矢印テキストモデル、および数学的問題解決の4つのドメインにおけるクロスエントロピー損失の経験的スケーリング則を特定します。すべての場合において、自己回帰トランスフォーマーは、べき乗則と一定のスケーリング則に従って、モデルサイズと計算バジェットが増加するにつれてパフォーマンスがスムーズに向上します。最適なモデルサイズは、べき乗則による計算バジェットにも依存し、指数はすべてのデータドメインでほぼ普遍的です。クロスエントロピー損失には、S（True）+ D_KL（True || Model）としての情報理論的解釈があり、経験的スケーリング法則は、真のデータ分布のエントロピーと、真の分布とモデル分布の間のKL発散の両方の予測を示唆しています。この解釈により、10億パラメータのトランスフォーマーは、8×8の解像度にダウンサンプリングされたYFCC100M画像分布のほぼ完全なモデルであり、他の解像度のnats / imageで任意の削減可能な損失（つまりD_KL）を達成するために必要なモデルサイズを予測できます。。特定のドメインでいくつかの追加のスケーリング法則を見つけます。（a）マルチモーダルモデルのキャプションと画像間の相互情報量のスケーリング関係を特定し、「画像は千の言葉に値するか」という質問に答える方法を示します。（b）数学的問題解決の場合、トレーニング分布を超えて外挿するときに、モデルのパフォーマンスのスケーリング則を特定します。（c）ImageNet分類の生成画像モデルを微調整し、生成損失が横ばいになっても、分類損失とエラー率のスムーズなスケーリングを見つけます。まとめると、これらの結果は、スケーリング則がダウンストリームタスクを含むニューラルネットワークのパフォーマンスに重要な影響を与えるというケースを強化します。

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image\leftrightarrowtext models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as S(True) + D_KL(True||Model), and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an 8×8 resolution, and we can forecast the model size needed to achieve any given reducible loss (ie D_KL) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.

updated: Wed Oct 28 2020 02:17:24 GMT+0000 (UTC)

published: Wed Oct 28 2020 02:17:24 GMT+0000 (UTC)

arXiv

参考文献 (このサイトで利用可能なもの) / References (only if available on this site)

被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)

Amazon.co.jpアソシエイト