CLIP

less than 1 minute read

Dr Vlog gave a talk on CLIP to Math PhDs and summarized in a 50mins video.

0 Paper

OpenAI’s early works for early multimodality model training Alt text

1 How it works

The fundamention idea is contrastive learning between image and text contents, thus it gets the name CLIP( Contrastive Language-Image Pretraining) Alt text Can be used for zero-shot inferences, which is the main advantage. and it’s zero shot result is even higher than 1 shot, which is totally contract to human behavior. So this is telling us the real “learning” capability of human is still way stronger than current models.

Here is the dummy training code for CLIP Alt text It adapts well to images sets which is dramatically different from the original one

3 Applications

Apple’s MM1 model is based on CLIP Alt text and it also shows amazing few-shots learing capability

Twitter Facebook LinkedIn

You May Also Enjoy

Stream Batch process

May 31 2025

One zhihu blog popped up on my frontpage and had some discussion about streaming batch process. So I followed couple of the passages and here are some high l...

CUDA

May 21 2025

1 Concepts thread thread block, consists of warps, executed on SM(Streaming Multiprocessor) warp, is a 32 thread block. A warp is executed physically ...

Slurm and Enroot

May 19 2025

Finally touching on Slurm system. First heard about during CGG time, and we had some brief discussing of using it for cluster jobs. But our own implemention ...

NVLink, InfiniBand and SpectrumX

May 13 2025

Summary from zhihu post, which some picture from here.