K8S behind DGXCloud and NVCF

1 minute read

Recently all work seems K8S related and practices around k8s helped me onboard DGXCloud and NVCF Helm deployment really fast.

0 Web Server

It’s totally irrelavent but some information about web sever

WSGI: The old generation, for Web Sever Gateway Interface. It uses framework like Flask and Djando. and servers with names I never heard of.
ASGI: The new generation and A stands for Async. FastAPI is such framework and uvicorn is the webserver application.

1 DGXCloud

Run:ai CLI and k8s setup following steps here
Run runai login and authenicate through OSS
Run runai kubeconfig set to get token set
Then it’s like a normal cluster operation with a specific namespace.It’s not shown up in DGXC UI due to not in runai schedular.
Will add more details

2 NVCF

NVCF can be considered as serverless k8s, and bring your own cluster approach would make the deployment much more transparent.

Register cluster with NVCF.
I haven’t tested out this step but worked with a registerd cluster
Health Check implementation in code.
All is needed is a health check.

app = FastAPI()
class HealthCheck(BaseModel):
    status: str = "OK"

@app.get("/health", tags=["healthcheck"],summary="Perform a Health Check", response_description="Return HTTP Status Code 200 (OK)", status_code=status.HTTP_200_OK,
response_model=HealthCheck)
def get_health() -> HealthCheck:
    return HealthCheck(status="OK") 

Helm chart preparation
A service will be exposed in NVCF, and will have /health and /generate two endpoint implemented in the Pod behind it.

helm template chart/
helm generate chart/
# Will generate chart_name-version.tgz based on Chart.yaml
ngc registry chart create ngc_org_id/chart_name --short-desc "chart des"
ngc registry chart push ngc_org_id/chart_name:version --dry-run

NVCF Deployment
Set the service for exposing ports, and endpoints for health check and inference during function creations.

# Deploy to the on-prem cluster
ngc cf fn deploy create function_id:version_id --targeted-deployment-specification "A100:ON-PREM.GPU.A100_8x:1:1:1:machinename-a100x8"

Twitter Facebook LinkedIn

You May Also Enjoy

CUDA

May 21 2025

1 Concepts thread thread block, consists of warps, executed on SM(Streaming Multiprocessor) warp, is a 32 thread block. A warp is executed physically ...

Slurm and Enroot

May 19 2025

Finally touching on Slurm system. First heard about during CGG time, and we had some brief discussing of using it for cluster jobs. But our own implemention ...

NVLink, InfiniBand and SpectrumX

May 13 2025

Summary from zhihu post, which some picture from here.

Disagg PD in vLLM and LMCache

May 02 2025

Tested out Disagg PD in vLLM and sth about LMCache, an open-source Knowledge Delivery Network (KDN), and the Redis for LLMs.