Deployment on the NRP Nautilus Kubernetes (k8) cluster with SuperSONIC
#
For server-side large-scale deployment we are using the SuperSONIC framework. To begin, clone the repository
git clone git@github.com:fastmachinelearning/SuperSONIC.git
Deploying the Server#
Deploying the server on Nautilus is as easy as sourcing the setup script:
source deploy-nautilus-atlas.sh
The settings are defined in values/values-nautilus-atlas.yaml
files. To select which GPUs to run on, uncomment the appropriate lines of code corresponding to the following possible selections:
- NVIDIA-A10
- NVIDIA-A40
- NVIDIA-A100-SXM4-80GB
- NVIDIA-L40
- NVIDIA-A100-80GB-PCIe
- NVIDIA-A100-80GB-PCIe-MIG-1g.10gb
- NVIDIA-L4
- NVIDIA-A100-PCIE-40GB
- NVIDIA-GH200-480GB
Note
To run on A100s at NRP, you have to reserve a time. This can be accomplished at this link
To load multiple models per GPU, edit the triton.args
string to:
args:
- |
/opt/tritonserver/bin/tritonserver \
--model-repository=/traccc-aaS/traccc-aaS/backend/nmodels_<NUMBER_OF_MODELS> \
--log-verbose=1 \
--exit-on-error=true
and be sure to replace <NUMBER_OF_MODELS>
with the number of Triton model instances you’d like to load. By default, this loads one model instance.
Finally, to run across multiple GPUs, increase or decrease the replicas: n
where n
is the number of GPUs to request. This loads one Triton server per GPU with the request number of models you’ve selected.
For more information on configuring the server, consult the SuperSONIC Configuration Guide.
Running the client#
In order for the client to interface with the server, the location of the server needs to be specified. First, ensure the server is running
kubectl get pods -n atlas-sonic
which has output something like:
NAME READY STATUS RESTARTS AGE
envoy-atlas-7f6d99df88-667jd 1/1 Running 0 86m
triton-atlas-594f595dbf-n4sk7 1/1 Running 0 86m
or use the k9s tool to manage your pods. You can then check everything is healthy with
curl -kv https://atlas.nrp-nautilus.io/v2/health/ready
which should produce somewhere in the output the lines:
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
Then, the client can be run with, for instance:
python TracccTritonClient.py -u atlas.nrp-nautilus.io --ssl
To see what’s going on from the server side, run
kubectl logs triton-atlas-594f595dbf-n4sk7
where triton-atlas-594f595dbf-n4sk7
is the name of the server found when running the get pods
command above.
To run with perf_analyzer
and make plots, consult the performance repo.
!!! Important !!!#
Make sure to uninstall
once the server is not needed anymore.
helm uninstall atlas-sonic -n atlas-sonic
Make sure to read the Policies before using Nautilus.