How to deploy AI Models from the Model Library Innovation Release
- Hybrid Manager dual release strategy
- Documentation for the current Long-term support release
Prerequisite: Access to the Hybrid Manager UI with AI Factory enabled. See AI Factory in Hybrid Manager.
This guide explains how to deploy AI models from the AI Factory Model Library into Model Serving (powered by KServe) in your Hybrid Manager (HM) environment.
Once deployed, these models power key AI Factory features:
- Knowledge Bases (via AIDB pipelines)
- Gen AI Builder Assistants and pipelines
- Other AI Factory and application integrations
Who should use this guide?
- AI platform admins deploying validated model images
- Data engineers configuring AI models for Knowledge Bases
- AI application developers configuring models for Assistants
What this enables
Once deployed:
- Your AI models are available in Model Serving.
- You can link them to Knowledge Bases or Gen AI Builder pipelines.
- You can monitor and manage deployed models via the HM Model Serving UI or Kubernetes.
Estimated time to complete
10–20 minutes per model, depending on model size and cluster resources.
Prerequisites
Before you begin:
- An active HM environment with GPU worker nodes configured.
- For a full setup guide, see: Setup GPU Resources for Model Serving
- Prepare the credentials for model provider NIM and HuggingFace.
If you will use models from NVIDIA NIM from the public internet, you need to prepare two credentials.
The nvidia-nim-secret will be used to download NIM profiles.
apiVersion: v1 data: NGC_API_KEY: <base64 encoded NGC API Key> kind: Secret metadata: annotations: replicator.v1.mittwald.de/replicate-to: m-.* labels: name: nvidia-nim-secrets namespace: default type: Opaque
The ngc-cred will be used to pull images from NVIDIA NIM.
$ kubectl -n default create secret docker-registry ngc-cred \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=${NGC_API_KEY}
$ kubectl -n default annotate secret ngc-cred \
replicator.v1.mittwald.de/replicate-to='m-.*'If you already store profiles in the object storage and images in your private registry, you don't need these two secrets. See: How-To Use NVIDIA NIM Model Cache in Air‑Gapped Clusters in Hybrid Manager
If you will use a private HuggingFace model, create the following secret.
apiVersion: v1 kind: Secret metadata: name: hf-secret namespace: default annotations: replicator.v1.mittwald.de/replicate-to: m-.* type: Opaque data: HF_TOKEN: <base64 encoded HF API Key>
Steps to deploy an AI model
1. Creating a model in Asset Library
- Go to Asset Library > Models.
- Select Add New Model
- Configure parameters:
- Model Name, must consist of lower-case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character.
- Description
- Tags
- Functions, must include at least one that starts with "aidb-".
- AI Model Provider
- If you select NIM provider,
- Image URL. for example, nvcr.io/nim/openai/gpt-oss-20b:latest.
- If you select HuggingFace, one of the following fields must be filled.
- Hugging Face Model Name, for example, openai/gpt-oss-20b. The model must be on Hugging Face.
- Object Storage Path, for example, /models/openai/gpt-oss-20b. The model must already be copied to the path.
- If you select NIM provider,
- Default CPU/Memory/GPU
- API Protocol Version. Select the model's API protocol. If not sure, keep it empty.
- Max Token Length, the expected context window size of the model. Keep it empty to use the model's default value.
- README
- Select Add AI Model.
2. Browsing and selecting model in Asset Library
- Go to Asset Library > Models.
- Browse models.
- Select the model you want to deploy.
3. Configuring and deploying the model
- Select "Create Local Inference Service".
- Configure deployment parameters:
- Local Inference Service Name
- Tags
- Model Serving Name, the name that you want to use in API calls. Keep it empty to use the default value.
- Model Profiles Path on Object Storage, the NIM profile path in the object storage. Ignore it when the model is not an NIM provider. See How-To Use NVIDIA NIM Model Cache in Air‑Gapped Clusters in Hybrid Manager.
- Inference Service Instances
- Resource requests/limits (GPU, CPU, and memory)
- Max Token Length, to replace the value defined in the model. Keep it empty to use the value assigned in the model.
- Select "Create Local Inference Service".
4. Verifying the deployed model
You can verify your deployed models using:
Model Serving UI in HM
- Go to Estate > Inference Services.
- Confirm inference service appears with status Active/Healthy.
5. Connecting the model to AI Factory workloads
Once the model is Ready:
You can select it in:
- Knowledge Base pipelines (for embedding or reranking)
- Gen AI Builder pipelines
- Assistant configurations
The UI will show models available for each use case based on their type (Embedding, Completion, Reranking, etc.).
Supported model types
| Model type | Example model |
|---|---|
| Text Completion | llama-3.3-nemotron-super-49b |
| Text Embedding | arctic-embed-l |
| Image Embedding | nvclip |
| OCR | paddleocr |
| Text Reranker | llama-3.2-nv-rerankqa-1b-v2 |
Tips & best practices
- GPU placement: Ensure your model matches your GPU capacity. Large models like llama-3.3-49b require multiple GPUs on a single node.
- Quota management: Limit number of large models deployed simultaneously to avoid overloading GPU nodes.
- Version testing: Test new model versions in isolated deployments before promoting to production pipelines or Assistants.
Troubleshooting
Model stuck in Pending
- Check GPU node taints/labels.
- Verify InferenceService tolerations and nodeSelectors match.
Model not appearing in Model Library
- Confirm image is correctly tagged and synced via Image and Model Library.
- Verify repository rules if using private registry.
Kubernetes errors on deploy
- Check
kubectl describe InferenceService <model>for detailed error logs.
Summary
- You can deploy AI models from the AI Factory Model Library.
- Deployed models run via KServe Model Serving.
- Deployed models power Knowledge Bases and Gen AI Builder Assistants.
- The deployment flow ensures consistent governance and visibility.