K8S job for NIM

2 minute read

To create a NIM by k8s job, I worked out it step by step translating of container operation into k8s scripts.

1 K8S Jobs

A k8s job is resource that no different from other resources. Adding pod template to it, you will get a job yaml file which can be submitted by kapp -f job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: nim
spec:
  template:

To monitor the job and see logs

  • kget job(nim) or kget pod(nim-xxxx)
  • kdes job/nim
  • klog pod/nim-xxxx

1 Lables

Metadata can be used to store labels, can be used by selector later

spec:
  template:
    metadata:
      labels:
        app: kh-nim-demo

Then the pod can be listed by kget pod -l app=kh-nim-demo

2 Secrects

  1. Docker login secrets imagePullSecrets. The config file is converted to base64 by
    cat ~/.docker/config.json | base64 -w0
    ```shell

    Define secret

    apiVersion: v1 kind: Secret metadata: name: kh-secret-nvcr #namespace: awesomeapps data: #cat ~/.docker/config.json | base64 -w0 .dockerconfigjson: BASE64_STRING type: kubernetes.io/dockerconfigjson

Use it as below

spec:

imagePullSecrets:

- name: kh-secret-nvcr

2. API tokens or Passwords
```shell
apiVersion: v1
kind: Secret
metadata:
  name: kh-secret-ngc
  #namespace: default
type: Opaque
stringData:
  API_KEY: "_API_TOKEN_KEYS_"
# Use it as below
      #  env:
      #  - name: NGC_API_KEY
      #     valueFrom:
      #       secretKeyRef:
      #         name: kh-secret-ngc
      #         key: API_KEY

3 PersistentVolumns

Two steps to claim a persistent disk for k8s

  1. Persistent Volumn Creation
    ```shell apiVersion: v1 kind: PersistentVolume metadata: name: kh-pv-cache spec: capacity: storage: 50Gi accessModes:
    • ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: local-storage local: path: /raid/models/nim # Replace with the actual local disk path on your node nodeAffinity: required: nodeSelectorTerms:
      • matchExpressions:
        • key: kubernetes.io/hostname operator: In values:
          • saea-dgx-a100-80gb # Replace with the actual node hostname ``` It will be Status: Available after creation and could be auto claimed by PVC in the same storageClass
  2. Persistent Volumn Claim creation
    ```shell apiVersion: v1 kind: PersistentVolumeClaim metadata: name: kh-pvc-cache spec: accessModes:
    • ReadWriteOnce storageClassName: local-storage resources: requests: storage: 50Gi ``` You can see all storage classes by
      kget storageclass

Now you can create a volumn and mount it into the container

## Create volumn by referring to PVC
      volumes:
      - name: nim-local-cache
        persistentVolumeClaim:
          claimName: kh-pvc-cache
## Mount into the container
        volumeMounts:
          - name: nim-local-cache
            mountPath: /opt/nim/.cache
        

4 Port

Container port --port 8001:8000 will be declared as

ports:
        - containerPort: 8000  # Container listens on port 8000, as NIM_SERVER_PORT
          hostPort: 8001       # Node will expose port 8001 mapped to container's 8000 port

A service nodePort can be created to further expose the port. But it’s not used in this example.

apiVersion: v1
kind: Service
metadata:
  name: kh-svc-port
spec:
  type: NodePort
  selector:
    app: kh-nim-demo
  ports:
  - protocol: TCP
    port: 8002 # node service port
    targetPort: 8001 # pod port
    nodePort: 30080 # external port 

5 GPU Specifications

  1. runtimeClassName: nvidia
  2. Num of GPUs
         resources:
           limits:
             nvidia.com/gpu: "2"
    
  3. GPU Visible Devices
         env:
         - name: NVIDIA_VISIBLE_DEVICES
           value: "0,1"
    

6 The Whole picture

apiVersion: batch/v1
kind: Job
metadata:
  name: nim
spec:
  template:
    metadata:
      labels:
        app: kh-nim-demo
    spec:
      imagePullSecrets:
      - name: kh-secret-nvcr
      volumes:
      - name: nim-local-cache
        persistentVolumeClaim:
          claimName: kh-pvc-cache
      runtimeClassName: nvidia
      containers:
      - name: nim-container
        image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
        volumeMounts:
          - name: nim-local-cache
            mountPath: /opt/nim/.cache
        resources:
          limits:
            nvidia.com/gpu: "2"
        ports:
        - containerPort: 8000  # Container listens on port 8000, as NIM_SERVER_PORT
          hostPort: 8001       # Node will expose port 8001 mapped to container's 8000 port
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "0,1"
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              name: kh-secret-ngc
              key: API_KEY
        - name: NIM_SERVER_PORT
          value: "8000"
        - name: NIM_JSONL_LOGGING
          value: "1"
        - name: NIM_LOG_LEVEL
          value: INFO
        - name: NIM_SERVED_MODEL_NAME
          value: "model_name1, model_name2"
      restartPolicy: Never

Tags:

Categories:

Updated: