Continuous Batch in Orca

less than 1 minute read

Everyone talks about continuous batch with Anyscale blog, but the original orca paper and talk gives some extra details

1 Request-level scheduling

Latency increase when there is finished request can not return due to other long output requests. Alt text

2 Iteration-level scheduling

For max_batch_size = 3 case

  1. Process requests in execution engine (x12) Alt text
  2. Move all requests in request pool together with new requsets (x123) Alt text
  3. Process with new added requests x3 Alt text
  4. Finished requests (x2) can send back to reponses and process new added requests (x4) Alt text

3 Batch issues with iteration-level

For batching to work, we need following criteris Alt text

  1. Batches are NOT in the same phase, prefill vs decode Alt text
  2. Batches have different length Alt text

4 Selective Batch

  1. For LayerNorm and Linear layers, we can concatenate requests to a uniformed processing Alt text
  2. Split batch and process each request individually and merge the output tensor Alt text

5 Orca

Orca is the system with these two features implemented. For conitnuous batching: Alt text For selective batching: Alt text

Tags:

Categories:

Updated: