Continuous Batch in Orca
Everyone talks about continuous batch with Anyscale blog, but the original orca paper and talk gives some extra details
1 Request-level scheduling
Latency increase when there is finished request can not return due to other long output requests.
2 Iteration-level scheduling
For max_batch_size = 3
case
- Process requests in execution engine (x12)
- Move all requests in request pool together with new requsets (x123)
- Process with new added requests x3
- Finished requests (x2) can send back to reponses and process new added requests (x4)
3 Batch issues with iteration-level
For batching to work, we need following criteris
- Batches are NOT in the same phase, prefill vs decode
- Batches have different length
4 Selective Batch
- For LayerNorm and Linear layers, we can concatenate requests to a uniformed processing
- Split batch and process each request individually and merge the output tensor
5 Orca
Orca is the system with these two features implemented. For conitnuous batching:
For selective batching: