AI technology for generating images, such as diffusion models, has advanced rapidly. However, there is no established framework for quantifying the reliability of AI-generated images, which hinders their use in critical decision-making tasks, such as medical image diagnosis. In this study, we propose a method to quantify the reliability of decision-making tasks that rely on images produced by diffusion models within a statistical testing framework. The core concept of our statistical test involves using a selective inference framework, in which the statistical test is conducted under the condition that the images are produced by a trained diffusion model. As a case study, we study a diffusion model-based anomaly detection task for medical images. With our approach, the statistical significance of medical image diagnostic outcomes can be quantified in terms of a p-value, enabling decision-making with a controlled error rate. We demonstrate the theoretical soundness and practical effectiveness of our statistical test through numerical experiments on both synthetic and brain image datasets.