DeepSeek OCR - Background

1 minute read

A deepseek paper about more Contexts Optical Compression than OCR, which could be the next big break through in VLM field. Great read from EZ encoder video

0 VLM Background

NLP and Vision work in the past decades has been merged into VLM work. Alt text Here are couple of image encoders: Vision Transformer is split picture into patches, and how to deal with picture with different resolution is the problem from ViT to all the following works Swin Transformer is doing CNN with Transformers at different scale, and swin is short for “Shifted Windows” Alt text

1 Combine Tokens

To combine text and image tokens, CLIP is a key work from OpenAI Alt text and this lays fundation of multimodal of working w any other modalities. Here is NExT-GPT paper about it and there are multiple ways to combine text and image/video tokens, like linear(LLava), attention(Qwen-VL) and cross-attention(Q-former, BLIP-2) (From Haoran’s previous Vary paper) Another example from DeepSeek-VL2, which uses Dynamic Tiling. Alt text This tiling idea is coming from InternVL 1.5 paper

2 Multiple Image Encoders

Due to the complicated situation of various images due to size, ratio and resolutions, we may employee multiple image encoders, like DeepSeek-VL paper uses 2 SAM-B and SigLIP-L. Alt text and Cambrian-1 from LeCun’s team in NYU even push to 4 encoders, SigLIP+DINOv2+ConvNext+CLIP Patch-Pack NaViT from Google proposed a way to use encoder for any ratio and resolution – Multiple patches from different images are packed in a single sequence—termed Patch n’ Pack—which enables variable resolution while preserving the aspect ratio. The downside of this method is generating too many tokens Alt text

Twitter Facebook LinkedIn

DeepSeek OCR - Background

0 VLM Background

1 Combine Tokens

2 Multiple Image Encoders

You May Also Enjoy

Nanobot MCP 集成 - 连接外部工具的桥梁

Nanobot 源码深度解析 - Agent 架构与运行机制

Nanobot Agent Skills 实践

Skills from the first principle