Job Description
Senior Site Reliability Engineer – Vellox Aviation Group
About the Role
Vellox develops and operates mission-critical software for aviation organizations where system reliability directly affects flight operations, regulatory compliance, and operational safety.
The Senior Site Reliability Engineer is responsible for the availability, performance, observability, and recoverability of production systems supporting flight operations, maintenance, and compliance workflows.
This role is responsible for production reliability as systems grow, move, and evolve in regulated aviation environments.
What You Will Own
Reliability, Ownership, and Service Health
·
Own availability, latency, throughput, and durability for production systems
·
Define and maintain service level indicators and service level objectives
·
Manage error budgets to guide engineering and operational decisions
·
Ensure reliability targets are met consistently
Production Architecture and Resilience
·
Design and operate highly available multi-availability zone and multi-region architectures
·
Ensure controlled and observable failure behavior
·
Define redundancy, graceful degradation, and automated recovery strategies
·
Validate failover and recovery through testing
Incident Response and Operational Maturity
·
Lead response to production incidents
·
Own root cause analysis focused on systemic contributors
·
Drive remediation actions to completion
·
Reduce incident frequency, severity, and blast radius over time
Observability and Operational Insight
·
Design centralized logging, metrics, alerting, and dashboards
·
Define observability standards tied to customer impact
·
Ensure alerts are actionable and low noise
·
Use operational data for capacity planning and scaling decisions
Automation and Toil Reduction
·
Identify and eliminate manual or repetitive operational tasks
·
Build automation to reduce operational risk
·
Standardize operational workflows
·
Treat simplicity as a reliability requirement
Data and Database Reliability
·
Own production database reliability
·
Design replication, backup, restore, and failover strategies
·
Validate recovery procedures regularly
·
Lead migrations to managed cloud databases such as AWS RDS or Aurora
Technical Qualifications
Cloud and Infrastructure
· Hands-on experience operating production systems on AWS or Azure
·
Strong understanding of networking, IAM, load balancing, and managed services
·
Ability to balance cost, reliability, and operational complexity
Distributed Systems
·
Experience operating distributed systems in production
·
Strong understanding of partial failure and recovery patterns
·
Ability to diagnose cross-stack production issues
Observability and Operations
·
Experience with centralized logging, metrics, and alerting
·
Ability to design alerts based on service impact
·
Experience driving improvement from operational data
Programming and Automation
·
Strong scripting skills using Python, Node.js, or shell
·
Ability to write production-grade operational tooling
·
Comfort modifying application code to improve reliability
Databases
·
Experience in moving databases from EC2 instances to RDS, specifically on MSSQL
·
Experience in understanding Windows, access via RDP and operational dashboards.
·
Experience operating relational databases in production
·
Experience with replication, backup, restore, and failover
·
Experience migrating legacy databases to managed services preferred
Preferred Experience
·
Experience in regulated or safety-critical industries such as aviation
·
Deep experience in operating heavy write-ops DBs and migrating from Azure to AWS.
·
Familiarity with compliance, auditability, and traceability requirements
·
Experience supporting systems with direct operational impact
Ready to Apply?
Don't miss this opportunity! Apply now and join our team.
Job Details
Posted Date:
March 15, 2026
Job Type:
Construction
Location:
India
Company:
Vellox Group
Ready to Apply?
Don't miss this opportunity! Apply now and join our team.