Job Description
About T-Mobile:
T-Mobile US, Inc. (NASDAQ: TMUS), headquartered in Bellevue, Washington, is America’s supercharged Un-carrier, connecting millions through its strong nationwide network and flagship brands, T-Mobile and Metro by T-Mobile. Customers benefit from an unmatched combination of value, quality, and exceptional service experience.
About TMUS Global Solutions:
TMUS Global Solutions is a world-class technology powerhouse accelerating the company’s global digital transformation. With a culture built on growth, inclusivity, and global collaboration, the teams here drive innovation at scale, powered by bold thinking.
TMUS India Private Limited operates as TMUS Global Solutions.
Principal Engineer, Site Reliability - Supply Chain
Location:
T-Mobile India
Job Overview:
At
T-Mobile , we don’t just build technology — we empower people. We believe in investing in
YOU
— your growth, your impact, and your future. We’re unstoppable when individuals like you come together to solve bold challenges, inspire innovation, and build platforms that serve millions.
As a
Principal Site Reliability Engineer , you’ll join a
world-class engineering team
focused on building and scaling intelligent infrastructure for
LLM-based applications, AI services, and enterprise-scale backend systems . You’ll contribute to the design and implementation of observability, automation, and incident response strategies that ensure our platforms are
high-performing, reliable, and cost-effective . You’ll play a key role in
driving operational excellence , supporting platform scalability, and collaborating across engineering and architecture teams. This role provides growth opportunities to influence large-scale architecture and AI/ML reliability.
Key Responsibilities:
Design, develop and maintain observability, monitoring, and alerting systems for AI platforms and mission-critical backend services.
Design telemetry pipelines, logging infrastructure, and metrics dashboards using tools such as Splunk, Prometheus, Grafana, and OpenTelemetry.
Define and maintain SLOs, SLIs, and real-time health indicators across platform services and APIs.
Participate in on-call rotations and lead the resolution of high-impact incidents, including root cause analysis and postmortem reporting.
Collaborate with platform engineering teams to enforce governance, compliance, and security standards in production environments.
Enhance deployment pipelines, CI/CD workflows, and infrastructure automation (e.g., GitLab).
Optimize and scale infrastructure components such as Kafka, HAProxy, RMQ, databases, and distributed APIs.
Support capacity planning, cost analysis, and system tuning to improve platform performance.
Advocate for automation-first operations, reducing manual toil through scripting and reliability tooling.
Create and maintain documentation, runbooks, and knowledge-sharing resources across SRE and engineering teams.
Mentor junior engineers and foster a culture of technical rigor and continuous improvement.
Qualifications:
Bachelor’s degree in computer science, Engineering, or a related field (Master’s preferred).
10+ years of experience in SRE, DevOps, or operations engineering in cloud-based environments. Overall 15+ years in Technology space.
Hands-on experience with monitoring, alerting, and incident response in distributed systems.
Strong coding and scripting skills in Python, Java, or shell scripting languages such as Bash or PowerShell.
Solid understanding of database principles and experience with distributed storage solutions such as Oracle, Cassandra, SOLR, and Kafka.
Proficiency in CI/CD pipelines and GitLab workflows.
Strong working knowledge of SQL and NoSQL databases, including Oracle and Cassandra.
Expertise in Linux, networking concepts (TLS/SSL, DNS, load balancers), and troubleshooting large-scale environments.
Familiarity with AI/ML systems, APIs, and modern LLM tooling is a strong plus.
Expertise in observability tools such as Splunk, Grafana, and Prometheus.
Experience with Kubernetes, container orchestration, and hybrid/multi-cloud deployments (Azure preferred; AWS/GCP/OCI acceptable).
Deep understanding of security concepts and protocols, including authentication, authorization, encryption, SSL/TLS, SSH/SFTP, PKI, X.509 certificates, and PGP.
Excellent knowledge of ITIL/ServiceNow terminology for incident and problem management.
Proven ability to work in fast-paced, incident-driven environments with high uptime requirements.
Preferred Qualifications:
Experience supporting AI workloads, model inference systems, or LLM-enabled platforms.
Exposure to AIOps or related ML platform observability and reliability practices.
Familiarity with LangChain, OpenAI, Spring AI, and MCP Server is a strong plus.
Experience in highly regulated telecom environments with compliance and audit controls.
Understanding of AI Gateway patterns and secure API orchestration.
Background in building secure, zero-downtime platforms with enterprise-scale SLAs.
Knowledge, Skills, and Abilities:
Strong grasp of SRE best practices, including SLOs, SLIs, postmortems, and chaos engineering.
Ability to diagnose system bottlenecks across infrastructure, application, and network layers.
Expertise in driving automation across observability, configuration, and deployment domains.
Excellent communication and collaboration skills in cross-functional technical teams.
Curiosity-driven mindset with a passion for learning emerging AI technologies and improving system reliability.
Strong commitment to automating processes for proactive monitoring, anomaly detection, and alerting.
Why Join T-Mobile India?
At
T-Mobile India , you won’t just contribute to world-class technology—you’ll help build it. You’ll
work with global leaders , solve complex system challenges, and build platforms that redefine how technology powers customer experience.
We’re more than just a telecom company—we’re a
technology powerhouse
leading the way in
AI, data, and digital innovation . And we do it all with heart, grit, and a passion for empowering people.
Join us and shape the future of intelligent platforms that serve millions — at the scale and speed of T-Mobile.
Disclaimer: TMUS India Private Limited, operating as TMUS Global Solutions, has engaged ANSR, Inc. ("ANSR") as its exclusive recruiting partner. That means that any communications regarding TMUS Global Solutions opportunities or employment offers will be issued only through ANSR and the 1Recruit platform. If you receive a communication or offer from another individual or entity, please notify TMUS Global Solutions immediately.
TMUS Global Solutions will never seek any payment or other compensation during the hiring process or request sensitive personal data (such as bank details or government-issued identification numbers) prior to a candidate’s acceptance of a formal offer.