Descrição da Vaga

Job Description

We are seeking a hands-on

Site Reliability Engineer (SRE) / AI Platform DevOps Engineer

to own infrastructure provisioning, CI/CD automation, telemetry pipelines, and production deployment for AI-powered services, agents, and orchestration systems. This is an

SRE-heavy, infrastructure-first role , focused on ensuring AI systems operating in production are: Reliable

Observable

Scalable

Secure

Cost-efficient

Safe to deploy and operate

You will play a critical role in building and maintaining the platform foundation that enables AI services to run safely and efficiently at scale. Key Responsibilities 1. Infrastructure Provisioning & Automation Design and manage cloud infrastructure using Infrastructure as Code (Terraform or similar)

Provision and maintain Kubernetes clusters and supporting services

Automate environment setup across development, staging, and production

Manage networking, IAM, secrets, storage, and compute scaling

Ensure high availability, resilience, and disaster recovery readiness

2. CI/CD & Deployment Engineering Build and maintain CI/CD pipelines for: AI services

Agent frameworks

Orchestrators

Model artifacts

Implement automated testing and reliability validation gates

Enable blue/green and canary deployments

Build safe rollback mechanisms for services and models

Integrate reliability and health checks into deployment workflows

3. Model & Agent Deployment Governance Package, version, and deploy models into containerized environments

Manage model artifact storage and promotion across environments

Monitor model performance and detect degradation

Support retraining cycle integration and model refresh workflows

Ensure safe rollout and rollback of model versions

Implement monitoring for inference latency, throughput, and cost

4. Data Pipelines for Telemetry & Observability Design and maintain data pipelines to ingest, clean, and process high-volume telemetry (logs, metrics, traces, events)

Enable structured telemetry for AI and orchestration workflows

Ensure reliability for real-time and batch processing

Optimize pipeline scalability and performance

5. AIOps Platform Integration Evaluate, deploy, and integrate AIOps platforms

Improve anomaly detection, correlation, and alert intelligence

Reduce alert noise and improve signal quality

Integrate AIOps outputs into operational workflows and incident management

6. Intelligent Incident Automation Automate incident detection and remediation workflows

Build self-healing scripts and intelligent runbooks

Reduce MTTD and MTTR through automation

Integrate AI-driven root cause analysis insights into operational tooling

Improve prevention of recurring incidents

7. Production Reliability & SRE Excellence Define and manage SLIs, SLOs, and error budgets

Implement monitoring, dashboards, and alerting systems

Participate in on-call rotation

Lead incident triage and root cause analysis

Improve resilience, scaling, and failure handling

Implement circuit breakers, rate limits, and failover mechanisms

8. Security & Governance Implement least-privilege access controls

Manage secrets and credential rotation

Enforce environment isolation

Ensure auditability and compliance for AI systems

Qualifications

Required Experience 5+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles

Strong hands-on experience with cloud platforms (AWS, Azure, or GCP)

Proven expertise with Kubernetes and containerized workloads

Experience with Infrastructure as Code (Terraform, CloudFormation, etc.)

Strong CI/CD implementation experience (GitHub Actions, GitLab CI, Jenkins, etc.)

Experience building observability stacks (Prometheus, Grafana, OpenTelemetry, ELK, Datadog, etc.)

Experience defining and managing SLIs/SLOs and error budgets

Hands-on experience with incident response and production support

Strong scripting skills (Python, Bash, or similar)

AI/ML Platform Experience (Strongly Preferred) Experience deploying and managing AI/ML services in production

Familiarity with model packaging, versioning, and artifact management

Understanding of model lifecycle management and retraining workflows

Experience monitoring inference performance, latency, and cost

Exposure to AIOps tools and intelligent alerting systems

Additional Skills Strong understanding of distributed systems reliability patterns

Knowledge of security best practices in cloud-native environments

Experience implementing high-availability and disaster recovery strategies

Excellent problem-solving and root cause analysis skills

Strong communication skills and ability to collaborate across engineering and AI teams

Additional Information

Discover some of the global benefits that empower our people to become the best version of themselves: Finance:

Competitive salary package, share plan, company performance bonuses, value-based recognition awards, referral bonus; Career Development : Career coaching, global career opportunities, non-linear career paths, internal development programmes for management and technical leadership; Learning Opportunities:

Complex projects, rotations, internal tech communities, training, certifications, coaching, online learning platforms subscriptions, pass-it-on sessions, workshops, conferences; Work-Life Balance:

Hybrid work and flexible working hours, employee assistance programme; Health:

Global internal wellbeing programme, access to wellbeing apps; Community:

Global internal tech communities, hobby clubs and interest groups, inclusion and diversity programmes, events and celebrations. At Endava, we’re committed to creating an open, inclusive, and respectful environment where everyone feels safe, valued, and empowered to be their best. We welcome applications from people of all backgrounds, experiences, and perspectives—because we know that inclusive teams help us deliver smarter, more innovative solutions for our customers. Hiring decisions are based on merit, skills, qualifications, and potential. If you need adjustments or support during the recruitment process, please let us know.

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Candidatar Agora

Detalhes da Vaga

Data de Publicação: February 27, 2026

Tipo de Vaga: Tecnologia

Localização: Brazil

Company: Endava

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Candidatar Agora

DevOps Engineer - AIOps

Descrição da Vaga

Ready to Apply?

Detalhes da Vaga

Ready to Apply?