K8S job for NIM
To create a NIM by k8s job, I worked out it step by step translating of container operation into k8s scripts.
1 K8S Jobs
A k8s job is resource that no different from other resources. Adding pod template to it, you will get a job yaml file which can be submitted by kapp -f job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: nim
spec:
template:
To monitor the job and see logs
kget job(nim) or kget pod(nim-xxxx)
kdes job/nim
klog pod/nim-xxxx
1 Lables
Metadata can be used to store labels, can be used by selector
later
spec:
template:
metadata:
labels:
app: kh-nim-demo
Then the pod can be listed by kget pod -l app=kh-nim-demo
2 Secrects
- Docker login secrets
imagePullSecrets
. The config file is converted to base64 by
cat ~/.docker/config.json | base64 -w0
```shellDefine secret
apiVersion: v1 kind: Secret metadata: name: kh-secret-nvcr #namespace: awesomeapps data: #cat ~/.docker/config.json | base64 -w0 .dockerconfigjson: BASE64_STRING type: kubernetes.io/dockerconfigjson
Use it as below
spec:
imagePullSecrets:
- name: kh-secret-nvcr
2. API tokens or Passwords
```shell
apiVersion: v1
kind: Secret
metadata:
name: kh-secret-ngc
#namespace: default
type: Opaque
stringData:
API_KEY: "_API_TOKEN_KEYS_"
# Use it as below
# env:
# - name: NGC_API_KEY
# valueFrom:
# secretKeyRef:
# name: kh-secret-ngc
# key: API_KEY
3 PersistentVolumns
Two steps to claim a persistent disk for k8s
- Persistent Volumn Creation
```shell apiVersion: v1 kind: PersistentVolume metadata: name: kh-pv-cache spec: capacity: storage: 50Gi accessModes:- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /raid/models/nim # Replace with the actual local disk path on your node
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- saea-dgx-a100-80gb # Replace with the actual node hostname
```
It will be
Status: Available
after creation and could be auto claimed by PVC in the samestorageClass
- saea-dgx-a100-80gb # Replace with the actual node hostname
```
It will be
- key: kubernetes.io/hostname
operator: In
values:
- matchExpressions:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /raid/models/nim # Replace with the actual local disk path on your node
nodeAffinity:
required:
nodeSelectorTerms:
- Persistent Volumn Claim creation
```shell apiVersion: v1 kind: PersistentVolumeClaim metadata: name: kh-pvc-cache spec: accessModes:- ReadWriteOnce
storageClassName: local-storage
resources:
requests:
storage: 50Gi
```
You can see all storage classes by
kget storageclass
- ReadWriteOnce
storageClassName: local-storage
resources:
requests:
storage: 50Gi
```
You can see all storage classes by
Now you can create a volumn and mount it into the container
## Create volumn by referring to PVC
volumes:
- name: nim-local-cache
persistentVolumeClaim:
claimName: kh-pvc-cache
## Mount into the container
volumeMounts:
- name: nim-local-cache
mountPath: /opt/nim/.cache
4 Port
Container port --port 8001:8000
will be declared as
ports:
- containerPort: 8000 # Container listens on port 8000, as NIM_SERVER_PORT
hostPort: 8001 # Node will expose port 8001 mapped to container's 8000 port
A service nodePort
can be created to further expose the port. But it’s not used in this example.
apiVersion: v1
kind: Service
metadata:
name: kh-svc-port
spec:
type: NodePort
selector:
app: kh-nim-demo
ports:
- protocol: TCP
port: 8002 # node service port
targetPort: 8001 # pod port
nodePort: 30080 # external port
5 GPU Specifications
runtimeClassName: nvidia
- Num of GPUs
resources: limits: nvidia.com/gpu: "2"
- GPU Visible Devices
env: - name: NVIDIA_VISIBLE_DEVICES value: "0,1"
6 The Whole picture
apiVersion: batch/v1
kind: Job
metadata:
name: nim
spec:
template:
metadata:
labels:
app: kh-nim-demo
spec:
imagePullSecrets:
- name: kh-secret-nvcr
volumes:
- name: nim-local-cache
persistentVolumeClaim:
claimName: kh-pvc-cache
runtimeClassName: nvidia
containers:
- name: nim-container
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
volumeMounts:
- name: nim-local-cache
mountPath: /opt/nim/.cache
resources:
limits:
nvidia.com/gpu: "2"
ports:
- containerPort: 8000 # Container listens on port 8000, as NIM_SERVER_PORT
hostPort: 8001 # Node will expose port 8001 mapped to container's 8000 port
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1"
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: kh-secret-ngc
key: API_KEY
- name: NIM_SERVER_PORT
value: "8000"
- name: NIM_JSONL_LOGGING
value: "1"
- name: NIM_LOG_LEVEL
value: INFO
- name: NIM_SERVED_MODEL_NAME
value: "model_name1, model_name2"
restartPolicy: Never