Continuous Batch in Orca

less than 1 minute read

Everyone talks about continuous batch with Anyscale blog, but the original orca paper and talk gives some extra details

1 Request-level scheduling

Latency increase when there is finished request can not return due to other long output requests. Alt text

2 Iteration-level scheduling

For max_batch_size = 3 case

Process requests in execution engine (x12)
Move all requests in request pool together with new requsets (x123)
Process with new added requests x3
Finished requests (x2) can send back to reponses and process new added requests (x4)

3 Batch issues with iteration-level

For batching to work, we need following criteris Alt text

Batches are NOT in the same phase, prefill vs decode
Batches have different length

4 Selective Batch

For LayerNorm and Linear layers, we can concatenate requests to a uniformed processing
Split batch and process each request individually and merge the output tensor

5 Orca

Orca is the system with these two features implemented. For conitnuous batching: Alt text For selective batching:

Twitter Facebook LinkedIn

You May Also Enjoy

Preprocessing/Processor in vLLM

June 30 2025

Here are preprocessing related code in vLLM

Weight loading in vLLM

June 27 2025

Fixed a weight loading error. It was reporting There is no module or parameter named sth at weight loading. It took me couple of days to root cause this issu...

ECS Deployment Details

June 24 2025

Add Dynamo example into ECS, couple of pitfalls

AWS ECS

June 18 2025

I finally got access to an AWS account again and volunteernly to test deploying Dynamo on AWS ECS, which I only barely touched when Fargate was released.