Job Description
Principal Site Reliability Engineer
We are seeking an experienced Principal Site Reliability Engineer to join a dynamic Platform Tribe. This role focuses on a high-end, microservice-based platform designed to process billions of financial transactions per day. You will be part of a team chasing zero-latency and ensuring a smooth connection for global users regardless of bandwidth.
What you will be doing:
Manage day-to-day alerts, system checks, and issue escalation.
Provide 24x7 on-call support for critical SaaS events.
Proactively create monitors within the EKS/K8s ecosystem.
Deploy to clusters using Terraform and Helm/Flux.
Enhance infrastructure health by implementing checks and scripts for known issues.
Maintain and develop deployment code and integrate new Cloud Infrastructure technologies.
Conduct RCA (Root Cause Analysis) and take corrective actions to prevent recurrence.
Collaborate with teams to ensure minimal impact during deployments and updates.
To succeed in this role, you will need:
Proficiency in Kubernetes (deployment, scaling, and troubleshooting).
Experience with configuration management tools like FluxCD or ArgoCD.
Strong experience with issue processing, including RCAs and Postmortems.
Familiarity with AWS, Terraform, Docker, and CI/CD.
Experience with monitoring and logging tools: DataDog, Prometheus, Grafana, and ELK Stack (Elasticsearch, Logstash, Kibana) or CloudWatch.
Strong understanding of networking concepts and protocols.
Proficiency in at least one scripting language: Python, Go, or NodeJS.
Familiarity with incident management tools like PagerDuty, Opsgenie, or VictorOps.
Ready to Apply?
Don't miss this opportunity! Apply now and join our team.
Job Details
Posted Date:
February 23, 2026
Job Type:
Construction
Location:
Indonesia
Company:
Explore Group
Ready to Apply?
Don't miss this opportunity! Apply now and join our team.