arXiv reaDer
Learning Invariant Causal Mechanism from Vision-Language Models
Large-scale pre-trained vision-language models such as CLIP have been widely applied to a variety of downstream scenarios. In real-world applications, the CLIP model is often utilized in more diverse scenarios than those encountered during its training, a challenge known as the out-of-distribution (OOD) problem. However, our experiments reveal that CLIP performs unsatisfactorily in certain domains. Through a causal analysis, we find that CLIP's current prediction process cannot guarantee a low OOD risk. The lowest OOD risk can be achieved when the prediction process is based on invariant causal mechanisms, i.e., predicting solely based on invariant latent factors. However, theoretical analysis indicates that CLIP does not identify these invariant latent factors. Therefore, we propose the Invariant Causal Mechanism for CLIP (CLIP-ICM), a framework that first identifies invariant latent factors using interventional data and then performs invariant predictions across various domains. Our method is simple yet effective, without significant computational overhead. Experimental results demonstrate that CLIP-ICM significantly improves CLIP's performance in OOD scenarios.
updated: Mon Aug 12 2024 10:53:03 GMT+0000 (UTC)
published: Fri May 24 2024 07:22:35 GMT+0000 (UTC)
参考文献 (このサイトで利用可能なもの) / References (only if available on this site)
被参照文献 (このサイトで利用可能なものを新しい順に) / Citations (only if available on this site, in order of most recent)
Amazon.co.jpアソシエイト