Job Description
Role Overview
We are looking for an experienced
Senior
Site Reliability Engineer (SRE)
to ensure the reliability, scalability, and performance of our production systems. The ideal candidate will have strong troubleshooting skills, hands-on experience with
messaging queues ,
in-memory queues ,
Kubernetes , and
deployment automation , along with expertise in
Infrastructure as Code
and
microservices architecture .
Key Responsibilities
Application Troubleshooting:
Diagnose and resolve complex application issues in production environments.
Queue Management:
Work with
messaging queues (Kafka, RabbitMQ)
and
in-memory queues (Redis)
to maintain system performance.
Deployment & Automation:
Manage deployments using CI/CD pipelines and automation tools.
Kubernetes Administration:
Maintain and optimize Kubernetes clusters for high availability and scalability.
Production Support:
Provide support for critical production systems, ensuring uptime and reliability.
Monitoring & Alerting:
Implement and maintain monitoring solutions (Prometheus, Grafana, ELK stack).
Incident Management:
Lead root cause analysis and post-mortem reviews for production incidents.
Must-Have Skills
Strong experience in
troubleshooting application issues
in distributed systems.
Hands-on experience with
messaging queues
(Kafka, RabbitMQ) and
in-memory queues
(Redis).
Proficiency in
Kubernetes
and container orchestration.
Experience with
CI/CD pipelines
and deployment automation.
Solid understanding of
Linux systems , networking, and cloud platforms (AWS, Azure, or GCP).
Infrastructure as Code
experience (Terraform, Ansible).
Knowledge of
microservices architecture .
Strong
scripting and automation
skills (Python, Bash, or similar).
Database expertise:
Working experience with
MySQL / Oracle / MongoDB .
Nice-to-Have
Experience with
WhatsApp Business Messaging APIs
and related integration skills.
Experience with
security best practices
in production environments.
Familiarity with
observability tools
and performance tuning.
Key Performance Indicators (KPIs)
System Uptime:
Maintain production uptime of
99.9% or higher .
Incident Response Time:
Respond to critical incidents within
15 minutes
and resolve within SLA.
Deployment Success Rate:
Achieve
98%+ successful deployments .
Mean Time to Recovery (MTTR):
Reduce MTTR for production issues to
under 60 minutes .
Automation Coverage:
Automate
80%+ of repetitive operational tasks .
Monitoring & Alerting:
Ensure
100% coverage of critical services
with proactive alerting.
Infrastructure as Code Adoption:
Maintain
100% IaC compliance
for infrastructure changes.
Why join us?
Impactful Work : Solve meaningful real-life business problems by building cutting-edge products.
Tremendous Growth Opportunities:
Work in a fast-growing CPaaS and product-driven culture with scope for continuous professional development.
Innovative Environment:
Be part of a world-class team that loves solving tough problems and values innovation.
Tanla is an equal opportunity employer. We champion diversity and are committed to creating an inclusive environment for all employees.
www.tanla.com
Ready to Apply?
Don't miss this opportunity! Apply now and join our team.
Job Details
Posted Date:
December 18, 2025
Job Type:
Construction
Location:
India
Company:
Karix
Ready to Apply?
Don't miss this opportunity! Apply now and join our team.