Job Description

Principal Site Reliability Engineer

We are seeking an experienced Principal Site Reliability Engineer to join a dynamic Platform Tribe. This role focuses on a high-end, microservice-based platform designed to process billions of financial transactions per day. You will be part of a team chasing zero-latency and ensuring a smooth connection for global users regardless of bandwidth.

What you will be doing: Manage day-to-day alerts, system checks, and issue escalation. Provide 24x7 on-call support for critical SaaS events. Proactively create monitors within the EKS/K8s ecosystem. Deploy to clusters using Terraform and Helm/Flux. Enhance infrastructure health by implementing checks and scripts for known issues. Maintain and develop deployment code and integrate new Cloud Infrastructure technologies. Conduct RCA (Root Cause Analysis) and take corrective actions to prevent recurrence. Collaborate with teams to ensure minimal impact during deployments and updates.

To succeed in this role, you will need: Proficiency in Kubernetes (deployment, scaling, and troubleshooting). Experience with configuration management tools like FluxCD or ArgoCD. Strong experience with issue processing, including RCAs and Postmortems. Familiarity with AWS, Terraform, Docker, and CI/CD. Experience with monitoring and logging tools: DataDog, Prometheus, Grafana, and ELK Stack (Elasticsearch, Logstash, Kibana) or CloudWatch. Strong understanding of networking concepts and protocols. Proficiency in at least one scripting language: Python, Go, or NodeJS. Familiarity with incident management tools like PagerDuty, Opsgenie, or VictorOps.

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now

Job Details

Posted Date: February 23, 2026

Job Type: Construction

Location: Indonesia

Company: Explore Group

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now