CLIP - Hibi's Note

# CLIP Contrastive Language-Image Pre-training [DALL-E - Wikipedia](https://en.wikipedia.org/wiki/DALL-E) DALL-E was developed and announced to the public in conjunction with CLIP (Contrastive Language-Image Pre-training).[20](https://en.wikipedia.org/wiki/DALL-E#cite_note-mittr-20) CLIP is a separate model based on [zero-shot learning](https://en.wikipedia.org/wiki/Zero-shot_learning "Zero-shot learning") that was trained on 400 million pairs of images with text captions [scraped](https://en.wikipedia.org/wiki/Web_scraping "Web scraping") from the Internet.[1](https://en.wikipedia.org/wiki/DALL-E#cite_note-vb-1)[20](https://en.wikipedia.org/wiki/DALL-E#cite_note-mittr-20)[21](https://en.wikipedia.org/wiki/DALL-E#cite_note-21)