SAM and BLIP

1 minute read

Segement Anything Model and Boostrap Lang-Image Pretraining

1 SAM

Meta published SAM for a very general segmentation task. Alt text

Couple of key technicals in SAM. SAM can take in various of input, points, box or text. Alt text and use a SA-1B dataset which is high quality in segmentation and really large quantity. Alt text 1B data is impossible to have human label them all. So a bootstrapping loop was used here Alt text Putting everything together, we get the fundation model Alt text Efficient SAM was introduced later Alt text Sacrifice accuracy for speed Alt text

2 BLIP

What is Bootstrapping in AL Alt text

Let’s see how BLIP is using this technology. Alt text First take a look at ALBEF( ALign the image and text representations BEfore Fusing)implementation.Doing contrastive learning as the alignment between text and image embeddings Alt text BLIP has ITC is very similar to ALBEF, but also adding ITM(matching) and LM to output the text for image Alt text The training starts with internet data and small sets of human label as bootstrapping. Alt text It can be used to correct title for a picture scrapping from internet $T_w$ with a synthetic $T_s$. The benefit is obvious, the previous title maybe irrelavant to the image but about personal feelings, and the corrected one is more useful for ML training. Alt text

3 BLIP-2

Querying Transformer, as Q-Former is the contribution of BLIP-2. Alt text Alt text First stage for fixed image encoder Alt text Second stage for fixed LLM Alt text Here are the results of BLIP-2 Alt text

Tags:

Categories:

Updated: