Job Description

We are CirrusLabs . Our vision is to become the world's most sought-after niche digital transformation company that helps customers realize value through innovation. Our mission is to co-create success with our customers, partners and community. Our goal is to enable employees to dream, grow and make things happen. We are committed to excellence. We are a dependable partner organization that delivers on commitments. We strive to maintain integrity with our employees and customers. Every action we take is driven by value. The core of who we are is through our well-knit teams and employees. You are the core of a values driven organization.

You have an entrepreneurial spirit. You enjoy working as a part of well-knit teams. You value the team over the individual. You welcome diversity at work and within the greater community. You aren't afraid to take risks. You appreciate a growth path with your leadership team that journeys how you can grow inside and outside of the organization. You thrive upon continuing education programs that your company sponsors to strengthen your skills and for you to become a thought leader ahead of the industry curve.

You are excited about creating change because your skills can help the greater good of every customer, industry and community. We are hiring a talented >to join our team. If you're excited to be part of a winning team, CirrusLabs ( http://www.cirruslabs.io ) is a great place to grow your career.

Experience - 5+ years

Location - Bengaluru/Hyderabad

Shift Time- 2 to 11 PM ISTOverview

We are seeking a hands-on AI/HPC Network Engineer to architect, build, and scale our next-generation AI Factory. In this role, you will own the critical "nervous system " of our AI platform: the high-performance network fabric.

This is a rare opportunity to work at the absolute bleeding edge of AI hardware. We are one of the first adopters deploying NVIDIA's GB300 architecture at scale. Moving away from traditional InfiniBand, our environment utilizes a cutting-edge, all-Ethernet architecture powered by NVIDIA Spectrum-X to deliver lossless, low-latency connectivity for massive GPU and CPU clusters.

You will serve as the subject matter expert (SME) for fabric architecture, deep-dive troubleshooting, and performance tuning, ensuring our researchers and data scientists have a highly available, redundant foundation for model training and inference.

This role is not pure "Cloud " DevOps: While we use cloud-native principles, this is a bare-metal infrastructure role. If your experience is limited to clicking buttons in AWS/Azure consoles without understanding physical topology, cabling, or switch OS internals, this is not a match.

Key Responsibilities

High-Performance Fabric Architecture

- Architect and deploy NVIDIA Spectrum-X Ethernet fabrics for massive GPU clusters, designing non-blocking Leaf-Spine topologies tailored for AI Backend (East-West) traffic. - Implement and tune robust Layer 3 underlay routing (BGP) to support high-performance RoCEv2 (RDMA over Converged Ethernet) traffic. - Design and configure EVPN/VXLAN overlays to provide workload isolation, multi-tenancy, and seamless integration with Kubernetes CNIs (Calico, Multus, SR-IOV). - Fine-tune "lossless " Ethernet behavior, including Priority Flow Control (PFC), ECN (Explicit Congestion Notification), and buffer/queue management to eliminate tail latency and microbursts during collective operations (AllReduce/AllGather). - Hardware & Physical Layer Engineering Lead the integration of NVIDIA ConnectX-8 SuperNICs and BlueField-3 DPUs, optimizing firmware settings and offload capabilities for maximum throughput. - Manage the physical connectivity lifecycle by validating and troubleshooting transceiver configurations, DAC/Client cabling, and optical budgets to ensure physical layer errors do not degrade training performance. - Maintain a deep understanding of the boundary between the Scale-Out network (Ethernet) and the Scale-Up network (NVIDIA NVLink network fabric). - Troubleshoot performance bottlenecks where network latency impacts UVM (Unified Virtual Memory) consistency or GPU-to-GPU memory access. - Performance Tuning & Reliability Define and execute structured benchmarking (using NCCL-tests, iperf, perftest) to validate bandwidth, latency, and congestion behavior. - Utilize advanced Linux networking tools (within NVIDIA Cumulus Linux on switches and RHEL/Ubuntu on hosts) to optimize the data path, including management of tc, kernel bypass, and driver parameters. - Build "glass-plane " visibility using NVIDIA NetQ and telemetry streaming, creating alerts for physical layer degradation (CRC errors, link flaps) and logical issues (ECMP hashing polarization, buffer exhaustion). - Cross-Functional Collaboration Ensure seamless high-throughput integration with AI-native storage subsystems (WEKA, VAST, NVMe-oF) over the Ethernet fabric. - Serve as the primary technical lead for escalations with NVIDIA and OEMs regarding switch OS bugs, transceiver compatibility, and ASIC firmware behavior. - Required Qualifications 5+ years of infrastructure engineering experience with a primary focus on High-Performance Networking or Datacenter Fabric design. - Proven experience designing and operating NVIDIA Spectrum-X or Mellanox-based fabrics, with deep familiarity with Cumulus Linux (NVUE/CLI). - Expert-level knowledge of RoCEv2, lossless Ethernet design, congestion control, and traffic class prioritization (QoS/DSCP). - Strong command of BGP, EVPN, and VXLAN in a production environment. - Experience managing physical datacenter connectivity, including optical transceiver selection, fiber types, and debugging Layer 1 issues. - Comfortable working in Linux environments for both switch OS management and host-side network interface tuning. - Preferred / Nice-to-Have Hands-on exposure to NVIDIA BlueField DPUs and the DOCA software framework. - Understanding of the NVIDIA HGX/NVL72 architecture, NVLink/NVSwitch concepts, and how GPU memory (HBM) interacts with the network. - Experience with Kubernetes networking (CNI plugins, Multus) or Slurm workload scheduling. - Proficiency in Ansible or Python for network configuration management and validation. - What Success Looks Like A stable Spectrum-X environment where packet loss is eliminated, and effective throughput achieves >95% of line rate. - ConnectX-8 NICs and DPUs are tuned to seamlessly handle bursty AI traffic patterns without CPU overhead. - Physical layer issues (bad cables/optics) and logical misconfigurations are detected and isolated via telemetry before they impact model training runs.

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now

Job Details

Posted Date: February 25, 2026

Job Type: Construction

Location: Hyderabad, India

Company: CirrusLabs

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now

Platform Engineer

Job Description

Ready to Apply?

Job Details

Ready to Apply?