Job Description

Role Overview We are looking for an experienced

Senior

Site Reliability Engineer (SRE)

to ensure the reliability, scalability, and performance of our production systems. The ideal candidate will have strong troubleshooting skills, hands-on experience with

messaging queues ,

in-memory queues ,

Kubernetes , and

deployment automation , along with expertise in

Infrastructure as Code

and

microservices architecture .

Key Responsibilities Application Troubleshooting:

Diagnose and resolve complex application issues in production environments. Queue Management:

Work with

messaging queues (Kafka, RabbitMQ)

and

in-memory queues (Redis)

to maintain system performance. Deployment & Automation:

Manage deployments using CI/CD pipelines and automation tools. Kubernetes Administration:

Maintain and optimize Kubernetes clusters for high availability and scalability. Production Support:

Provide support for critical production systems, ensuring uptime and reliability. Monitoring & Alerting:

Implement and maintain monitoring solutions (Prometheus, Grafana, ELK stack). Incident Management:

Lead root cause analysis and post-mortem reviews for production incidents.

Must-Have Skills Strong experience in

troubleshooting application issues

in distributed systems. Hands-on experience with

messaging queues

(Kafka, RabbitMQ) and

in-memory queues

(Redis). Proficiency in

Kubernetes

and container orchestration. Experience with

CI/CD pipelines

and deployment automation. Solid understanding of

Linux systems , networking, and cloud platforms (AWS, Azure, or GCP). Infrastructure as Code

experience (Terraform, Ansible). Knowledge of

microservices architecture . Strong

scripting and automation

skills (Python, Bash, or similar). Database expertise:

Working experience with

MySQL / Oracle / MongoDB .

Nice-to-Have Experience with

WhatsApp Business Messaging APIs

and related integration skills. Experience with

security best practices

in production environments. Familiarity with

observability tools

and performance tuning.

Key Performance Indicators (KPIs) System Uptime:

Maintain production uptime of

99.9% or higher . Incident Response Time:

Respond to critical incidents within

15 minutes

and resolve within SLA. Deployment Success Rate:

Achieve

98%+ successful deployments . Mean Time to Recovery (MTTR):

Reduce MTTR for production issues to

under 60 minutes . Automation Coverage:

Automate

80%+ of repetitive operational tasks . Monitoring & Alerting:

Ensure

100% coverage of critical services

with proactive alerting. Infrastructure as Code Adoption:

Maintain

100% IaC compliance

for infrastructure changes.

Why join us? Impactful Work : Solve meaningful real-life business problems by building cutting-edge products. Tremendous Growth Opportunities:

Work in a fast-growing CPaaS and product-driven culture with scope for continuous professional development. Innovative Environment:

Be part of a world-class team that loves solving tough problems and values innovation. Tanla is an equal opportunity employer. We champion diversity and are committed to creating an inclusive environment for all employees. www.tanla.com

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now

Job Details

Posted Date: December 18, 2025

Job Type: Construction

Location: India

Company: Karix

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now

Site Reliability Engineer (CPaaS)

Job Description

Ready to Apply?

Job Details

Ready to Apply?