This paper proposes a method for unsupervised whole-image clustering of a target dataset of remote sensing scenes with no labels. The method consists of three main steps: (1) finetuning a pretrained deep neural network (DINOv2) on a labelled source remote sensing imagery dataset and using it to extract a feature vector from each image in the target dataset, (2) reducing the dimension of these deep features via manifold projection into a low-dimensional Euclidean space, and (3) clustering the embedded features using a Bayesian nonparametric technique to infer the number and membership of clusters simultaneously. The method takes advantage of heterogeneous transfer learning to cluster unseen data with different feature and label distributions. We demonstrate the performance of this approach outperforming state-of-the-art zero-shot classification methods on several remote sensing scene classification datasets.