This study introduces a new framework for the artificial intelligence-assisted characterization of Gram-stained whole-slide images (WSIs). As a test for the diagnosis of bloodstream infections, Gram stains provide critical early data to inform patient treatment. Rapid and reliable analysis of Gram stains has been shown to be positively associated with better clinical outcomes, underscoring the need for improved tools to automate Gram stain analysis. In this work, we developed a novel transformer-based model for Gram-stained WSI classification, which is more scalable to large datasets than previous convolutional neural network (CNN) -based methods as it does not require patch-level manual annotations. We also introduce a large Gram stain dataset from Dartmouth-Hitchcock Medical Center (Lebanon, New Hampshire, USA) to evaluate our model, exploring the classification of five major categories of Gram-stained WSIs: Gram-positive cocci in clusters, Gram-positive cocci in pairs/chains, Gram-positive rods, Gram-negative rods, and slides with no bacteria. Our model achieves a classification accuracy of 0.858 (95% CI: 0.805, 0.905) and an AUC of 0.952 (95% CI: 0.922, 0.976) using five-fold nested cross-validation on our 475-slide dataset, demonstrating the potential of large-scale transformer models for Gram stain classification. We further demonstrate the generalizability of our trained model, which achieves strong performance on external datasets without additional fine-tuning.