SAM and BLIP

1 minute read

Segement Anything Model and Boostrap Lang-Image Pretraining

1 SAM

Meta published SAM for a very general segmentation task. Alt text

Couple of key technicals in SAM. SAM can take in various of input, points, box or text. Alt text and use a SA-1B dataset which is high quality in segmentation and really large quantity. 1B data is impossible to have human label them all. So a bootstrapping loop was used here Putting everything together, we get the fundation model Efficient SAM was introduced later Sacrifice accuracy for speed Alt text

2 BLIP

What is Bootstrapping in AL Alt text

Let’s see how BLIP is using this technology. Alt text First take a look at ALBEF( ALign the image and text representations BEfore Fusing)implementation.Doing contrastive learning as the alignment between text and image embeddings BLIP has ITC is very similar to ALBEF, but also adding ITM(matching) and LM to output the text for image The training starts with internet data and small sets of human label as bootstrapping. Alt text It can be used to correct title for a picture scrapping from internet $T_w$ with a synthetic $T_s$. The benefit is obvious, the previous title maybe irrelavant to the image but about personal feelings, and the corrected one is more useful for ML training.

3 BLIP-2

Querying Transformer, as Q-Former is the contribution of BLIP-2. Alt text First stage for fixed image encoder Second stage for fixed LLM Here are the results of BLIP-2

Twitter Facebook LinkedIn

SAM and BLIP

1 SAM

2 BLIP

3 BLIP-2

You May Also Enjoy

Stream Batch process

CUDA

Slurm and Enroot

NVLink, InfiniBand and SpectrumX