Home Job Listings Categories Locations

Ai Infrastructure And Kubernetes Platform

📍 India

Technology ThinkWise Consulting LLP

Job Description

Job Description:

We are seeking a highly skilled

AI Infrastructure and Kubernetes Platform Architect

with deep expertise in managing GPU-accelerated workloads on NVIDIA DGX systems. The ideal candidate will have hands-on experience with Kubernetes at the administrator, application developer, and security levels (CKA, CKAD, CKS), and will be responsible for designing, deploying, securing, and maintaining large-scale AI infrastructure powered by DGX BasePODs and SuperPODs. This role involves optimizing AI workloads, managing high-performance networking (InfiniBand), and ensuring operational excellence across NVIDIA AI systems and BlueField DPU environments.

Key Responsibilities: Kubernetes and AI Platform Orchestration Architect and maintain containerized AI/ML platforms using Kubernetes on DGX systems. Integrate NVIDIA Base Command Manager with Kubernetes for workload scheduling and GPU resource optimization. Implement and manage Helm charts, custom controllers, and GPU operators for scalable ML infrastructure. DGX Infrastructure Administration Administer and optimize NVIDIA DGX BasePODs and SuperPODs. Ensure optimal GPU, CPU, and storage performance across AI clusters. Leverage DGX System Administration best practices for lifecycle management and updates. High-Performance Networking & DPU Deploy, monitor, and manage InfiniBand networks using Unified Fabric Manager (UFM). Integrate BlueField DPUs for offloaded security, networking, and storage tasks. Optimize end-to-end data pipelines from storage to GPUs. Security and Compliance Apply best practices from the CKS certification to harden Kubernetes clusters and AI workloads. Implement secure service mesh and microsegmentation with BlueField DPU integration. Conduct regular audits, vulnerability scanning, and security policy enforcement. Automation & Monitoring Automate deployment pipelines and infrastructure provisioning with IaC tools (Terraform, Ansible). Monitor performance metrics using

GPU telemetry, Prometheus/Grafana, and NVIDIA DCGM. Troubleshoot and resolve complex system issues across hardware and software layers.

Required Skills and Qualifications: CKA, CKAD, CKS certifications

– demonstrating full-stack Kubernetes expertise. Proven experience with

NVIDIA DGX systems

and

AI workload orchestration

. Hands-on expertise in

InfiniBand networking

,

UFM

, and

BlueField DPU administration

. Strong scripting and automation skills in Python, Bash, YAML. Familiarity with

Base Command Manager

,

NVIDIA GPU Operator

, and

KubeFlow

is a plus. Ability to work across teams to support ML researchers, DevOps engineers, and infrastructure teams.

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Job Details

Posted Date: March 5, 2026
Job Type: Technology
Location: India
Company: ThinkWise Consulting LLP

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.