In recent years, there has been an increased interest in foundation models for geoscience due to the vast amount of Earth observing satellite imagery. Existing remote sensing foundation models make use of the various sources of spectral imagery to create large models pretrained on the task of masked reconstruction. In this paper, we present a foundation model framework, where the pretraining task captures the causal relationship between multiple modalities. Our framework leverages the knowledge guided principles that the spectral imagery captures the impact of the physical drivers on the environmental system, and that the relationship between them is governed by the characteristics of the system. Specifically, our method, called MultiModal Variable Step Forecasting (MM-VSF), uses forecasting of satellite imagery as a pretraining task and is able to capture the causal relationship between spectral imagery and weather. In our evaluation we show that the forecasting of satellite imagery using weather can be used as an effective pretraining task for foundation models. We further show the effectiveness of the embeddings produced by MM-VSF on the downstream tasks of pixel wise crop mapping and missing image prediction of spectral imagery, when compared with embeddings created by models trained in alternative pretraining settings including the traditional single modality input masked reconstruction.