Diagnostic Overview
This guide provides systematic troubleshooting procedures for Agent Factory components within Hybrid Manager environments. Issues typically fall into three categories: infrastructure problems, model serving failures, and application-level errors.
Quick Reference Links
Infrastructure Issues
GPU Resource Problems
Symptoms
- InferenceService pods remain pending
- "Insufficient nvidia.com/gpu" events
- Model initialization timeouts
Diagnostic Steps
# Check GPU node availability kubectl get nodes -l nvidia.com/gpu=true # Verify GPU resource allocation kubectl describe node <gpu-node-name> | grep -A 5 "Allocated resources" # Check GPU driver status kubectl logs -n gpu-operator nvidia-driver-daemonset-<pod>
Common Resolutions
Missing GPU Labels Nodes with GPUs must be properly labeled:
kubectl label nodes <node-name> nvidia.com/gpu=true
GPU Taint Issues Verify taint configuration for dedicated GPU scheduling:
kubectl taint nodes <node-name> nvidia.com/gpu=NoSchedule
Driver Compatibility
Ensure NVIDIA driver version matches CUDA requirements for NIM containers. Check driver logs in the gpu-operator namespace for initialization errors.
Storage Access Failures
Symptoms
- Model download failures
- "ImagePullBackOff" status
- Profile cache errors in air-gapped environments
Diagnostic Procedures
# Check secret configuration kubectl get secret nvidia-nim-secrets -n default -o yaml # Verify image pull secret kubectl get secret ngc-cred -n <namespace> -o yaml # Test registry connectivity kubectl run test-pull --image=nvcr.io/nim/nvidia/nvclip:latest --dry-run=client
Resolution Strategies
Registry Authentication Recreate NGC credentials if authentication fails:
kubectl delete secret ngc-cred -n default kubectl create secret docker-registry ngc-cred \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<NGC_API_KEY> \ -n default
Air-Gapped Environments Verify profile cache availability in object storage and correct path configuration in model deployment specifications.
Model Serving Failures
InferenceService Not Ready
Symptoms
- InferenceService shows "NotReady" status
- Predictor pods crash or restart
- Health check failures
Investigation Commands
# Check InferenceService status kubectl get inferenceservice <name> -n <namespace> # Examine detailed conditions kubectl describe inferenceservice <name> -n <namespace> # Review pod logs kubectl logs <predictor-pod> -n <namespace> -c kserve-container
Common Causes and Fixes
Insufficient Memory NIM models require substantial memory. Check pod resource requests:
kubectl describe pod <predictor-pod> -n <namespace> | grep -A 3 "Requests"
Increase memory allocation in InferenceService specification if necessary.
Model Loading Timeout Large models may exceed default initialization timeouts. Adjust readiness probe settings in the InferenceService configuration.
Profile Mismatch Ensure cached profiles match GPU architecture. List compatible profiles and verify cache contents.
High Inference Latency
Symptoms
- Response times exceed SLA requirements
- Token generation rates below expectations
- GPU underutilization
Performance Analysis
# Monitor GPU utilization kubectl exec <predictor-pod> -n <namespace> -- nvidia-smi # Check request metrics kubectl port-forward -n <namespace> svc/<service-name> 9090:9090 # Access metrics at localhost:9090/metrics
Optimization Approaches
Batch Size Tuning Adjust batch processing parameters in ServingRuntime configuration for improved throughput.
Replica Scaling Add InferenceService replicas to distribute load:
kubectl scale inferenceservice <name> -n <namespace> --replicas=3
Resource Allocation Verify GPU memory allocation matches model requirements. Insufficient GPU memory forces CPU fallback, severely impacting performance.
Langflow and Gen AI application issues
For issues with Langflow flows and EDB components — including EDB Knowledge Base retrieval failures, EDB Model Server errors, component configuration after flow import, and flow-level quality issues — see Troubleshooting Langflow in Hybrid Manager.
Log Analysis
Log Locations
Agent Factory components generate logs at multiple levels:
System Logs
- Kubernetes events:
kubectl get events -n <namespace> - Node logs:
/var/log/messagesor journalctl on GPU nodes
Application Logs
- InferenceService: Predictor pod container logs
- Langflow flows: Application pod logs
- Model Library: HM control plane logs
Metrics and Monitoring
- Prometheus metrics: Available through HM monitoring stack
- Custom dashboards: Accessible via Grafana in HM console
Log Aggregation
Configure centralized logging for comprehensive analysis:
# Stream logs from multiple components kubectl logs -f -l app=inference-service -n <namespace> --all-containers # Export logs for analysis kubectl logs <pod> -n <namespace> --since=1h > diagnostic.log
Alert Configuration
Critical Alerts
Configure monitoring alerts for critical conditions:
GPU Availability Alert when GPU nodes become unavailable or GPU allocation fails.
Model Health Monitor InferenceService readiness and restart frequency.
Performance Degradation Track inference latency percentiles and token generation rates.
Resource Exhaustion Alert on memory pressure, GPU memory saturation, or storage capacity.
Alert Integration
Integrate alerts with organizational notification systems through HM alert manager configuration.
Escalation Procedures
Support Resources
When internal troubleshooting proves insufficient:
- Documentation Review
- Consult Agent Factory Hub
- Review model-specific documentation
- Community Resources
- NVIDIA NIM documentation for model-specific issues
- KServe community for serving infrastructure problems
- EDB Support
- Collect diagnostic bundles using HM support tools
- Include relevant logs, configurations, and error messages
- Reference specific component versions and deployment specifications
Diagnostic Information Collection
Prepare comprehensive diagnostic information:
# Generate support bundle kubectl cluster-info dump --output-directory=./cluster-dump # Collect Agent Factory specifics kubectl get inferenceservice -A -o yaml > inferenceservices.yaml kubectl get pods -A -l serving.kserve.io/inferenceservice -o wide > serving-pods.txt kubectl describe nodes -l nvidia.com/gpu=true > gpu-nodes.txt
Preventive Measures
Regular Health Checks
Implement proactive monitoring:
- Weekly GPU driver and operator status verification
- Daily InferenceService health assessment
- Continuous performance baseline tracking
Capacity Planning
Monitor resource trends:
- GPU utilization patterns
- Memory consumption growth
- Storage usage projections
- Request volume trends
Update Management
Maintain component currency:
- Track NVIDIA NIM model updates
- Monitor security advisories
- Plan maintenance windows for updates
- Test updates in non-production environments