Job Description

About the Role The Site Reliability Engineering Specialist plays a critical role in safeguarding BT’s ability to deliver exceptional service performance, reliability, and availability across its digital platforms. In today’s fast-paced, cloud-driven AI environment, customers expect seamless experiences, and this position ensures those expectations are met by driving scalable, fault-tolerant, and cost-effective solutions. By enabling cross-team collaboration and implementing automation, monitoring, and resilience strategies, the specialist not only minimizes downtime and operational risk but also accelerates innovation and system evolution. This role is pivotal in maintaining BT’s reputation for reliability while empowering the business to adapt quickly to emerging technologies and deliver consistent value to customers worldwide.

Skills required for the job:- A degree in IT, Maths or Science A deep understanding of full stack monitoring solutions such as Dynatrace to ensure current end to end performance and trends of owned CDO Applications Strong proficiency in one or more programming languages (e.g. Java, Python) Experience with cloud platforms (AWS, Azure, or GCP) Solid understanding of software architecture, design patterns, and microservices Familiarity with CI/CD tools and DevOps practices High levels of quality presentation and reporting capabilities to collate output from Managed Service Partners Ability to adapt to latest industry trends CI/CD/CT Pipeline management Micro-Service functionality AI driven Observability & AIOps AIOps fundamentals (cross domain telemetry ingestion, event correlation, topology/context building, and remediation augmentation) Agentic/autonomous observability skills (using intelligent agents to detect anomalies, correlate signals, and trigger guarded remediations to cut MTTR) AI assisted alerting & noise reduction (designing contextual, business impact aware alerts; prioritization via ML).

What I’ll be doing – your accountabilities :- 1. Executes the implementation of new software development life cycle automation tools, frameworks, and code pipelines (continuous integration/continuous delivery pipelines whilst executing best practices with a focus on the re-use of application code, demonstrates consistent software delivery practices and produces continuous integration/continuous delivery platform solutions using Amazon Web Services cloud, infrastructure as code (IaC), GitOps, and container technologies 2. Coordinates a diverse team and creates the initial test schedule to deliver all aspects of testing to time, budget and quality targets, ensuring producing outlines of solutions and defining depth of testing required 3. Executes the implementation of automation technologies to ensure repeatability, eliminating toil, reducing mean time to detection and resolution and repair services 4. Proactively identifies and manages risk through regular assessment and diligent execution of controls and mitigations, proactively raising any concerns 5. Leads scale testing to measure, tune and optimise system performance 6. Executes metric/monitoring analysis that creates stability, security, and performance improvements 7. Designs, analyses, develops and troubleshoots highly distributed large-scale production systems spanning on-prem and cloud-based hosting 8. Executes approaches that scale systems sustainably through mechanisms like automation and evolves systems by pushing for changes that improve reliability and velocity 9. Writes and delivers infrastructure as code software to improve the availability, scalability, latency, and efficiency of services 10. Implements robust monitoring and alerting systems and performs root cause analysis and post-mortems with an eye towards future prevention 11. Inspects queue and support processing to ensure early warning of support issues 12. Executes retrospective and preventive actions after each high severity production incident 13. Analyses complex systems from a reliability and resilience perspective and identifies sources of instability in distributed systems 14. Champions, continuously develops and shares with team knowledge on emerging trends and changes in site reliability engineering best practices and industry standards 15. Mentors other site reliability engineers, helping to improve the team’s abilities by acting as a technical resource 16. Uses the network of site reliability engineers, removing BTs organisational boundaries to deliver improvements that are in synergy with initiatives being driven by other SREs.

Experience you would be expected to have:- Incident Response with AILLM assisted incident workflows (AI summaries, timeline drafting, suggested fixes, and post mortems integrated with Slack/Teams) Runbook automation with AI (building AI assisted, context aware runbooks and approval gates for high risk actions) Generative AI for coordination & RCA (using LLMs to accelerate investigation and communications; understanding current accuracy limits and human in the loop needs) ML Ops for Reliability SRE principles applied to ML systems (SLOs/SLIs/error budgets for ML services; capacity planning and model freshness) Production ML observability (data/concept/label drift detection, automated retraining triggers, explainability traces) Telemetry & visualization for model health (instrumentation with Prometheus/Grafana for drift and degradation) AI enhanced Automation & CI/CD AI augmented IaC and pipelines (LLM generated Terraform/Helm/Ansible, policy enforcement, drift detection in infra) AIOps in delivery (change impact hints, automated triage, and GitOps based auto remediation) AI pair programming ergonomics (using Copilot responsibly; measuring impact on quality/velocity and guardrails) AI + Chaos Engineering (Resilience) Designing AI guided chaos experiments (intelligent fault selection, anomaly detection during experiments, learning from outcomes) Reinforcement learning driven fault injection (automated scenario generation to expose latent weaknesses and improve recovery times) Operationalizing lessons from chaos + ML (predictive failure analysis and proactive controls) Platform & Tool Literacy (AI ready) Hands on with AIOps/observability platforms (event correlation and unified incident views at scale) Familiarity with AI enabled incident tooling (e.g., incident.io/Rootly/PagerDuty/Datadog for AI triage and summaries) Human in the loop guardrails (approval policies, rollback safety, and compliance in autonomous actions) Outcome measurement for AI adoption (MTTR, alert noise, developer experience/velocity with AI tools)

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now

Job Details

Posted Date: March 20, 2026

Job Type: Construction

Location: India

Company: BT Group

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now