Job Description
ViaPlus is seeking a Lead Cloud SRE to own the reliability, availability, and performance of large-scale, mission-critical platforms running on Microsoft Azure. This role is responsible for maintaining production stability across complex, distributed systems by leading incident response, observability, and reliability engineering initiatives.
The Lead Cloud SRE will work hands-on with Azure infrastructure, Kubernetes-based and VM-hosted microservices, networking, and data platforms to diagnose and resolve high-severity production issues. The role involves deep root-cause analysis using telemetry from Azure Monitor, Application Insights, and Log Analytics, as well as driving long-term remediation through automation, architectural improvements, and systemic fixes.
About Viaplus
:
ViaPlus is a global mobility company in the Intelligent Transportation Systems (ITS) market, specializing in revenue and services management solutions for the transportation industry. Our customer operations, data analytics, and full-featured single-account back-office technology facilitate the high-volume transactions, required for seamless multimodal mobility. As a VINCI Concessions subsidiary, we are committed to technical innovation and to promoting a positive mobility experience for all. We are pioneers in the transportation transaction and mobility industry, with a decade of proven global experience in providing solutions focused on the tolling and transit industries.
ViaPlus is headquartered near Dallas, Texas and maintains offices across the United States and in France, India, and Ireland. We are part of the global network of VINCI Concessions, an international player in transport infrastructure with projects in 23 countries. Our vision has evolved to provide a fully automated, end-to-end transportation solution that significantly improves revenue collection and efficiency while effectively lowering costs for our agency clients.
We serve enterprises that require high-volume, real-time transactions processing with the highest levels of accuracy, especially where revenue reconciliation and customer account management are key deliverables to the customer experience. Our flagship back-office system (BOS) enables Mobility-as-a-Service (MaaS) with a “one account” feature that supports multimodal transportation solutions. In a rapidly-changing environment, ViaPlus maintains a strong focus on technology and continuous R&D to improve agency efficiencies, reduce operating expenses, and maximize revenue – all while providing exceptional customer service.
About Indian Operations:
Plan, Design and Develop New Features for our Products | Customize our product on request from our premium Clients | Provide end-to-end IT Infrastructure set-up and Maintenance for global Clients | 24/7 Support and provide services to our ASP Clients
Job Profile: Lead Cloud SRE
Experience: 12 - 18 yrs
Job Responsibilities:
1. Azure Infrastructure & Virtual Machine Reliability
Diagnose and resolve complex Azure VM issues including boot failures, performance degradation, disk I/O latency, and memory leaks.
Troubleshoot VM Scale Sets, OS-level issues across Linux and Windows, and patching or upgrade failures.
Analyze and remediate network connectivity issues involving NSGs, UDRs, DNS resolution, and routing configurations.
2. Application & Microservices Reliability
Support and troubleshoot microservices-based architectures hosted on AKS and virtual machines.
Identify and resolve inter-service latency, timeouts, retry storms, and cascading failure scenarios.
Diagnose application-level issues such as thread pool exhaustion, memory leaks, misconfigurations, and resource contention.
Eliminates certificate, authentication, and upstream/downstream dependency failures impacting service availability.
3. Azure Service Fabric Operations
Maintain and restore Service Fabric cluster health and stability.
Troubleshoot node failures, replica movement delays, quorum loss, and partition health issues.
Investigate upgrade and rollback failures, ensuring minimal service disruption.
Analyze and optimize both stateful and stateless service behaviours.
4. Traffic Management, Load Balancing & Edge Services
Azure Application Gateway
Troubleshoot HTTP 502/503/504 errors and backend pool health issues.
Debug probe failures, SSL/TLS termination, listener configurations, and routing rules.
Optimize WAF rules for security, performance, and reduced false positives.
Azure Front Door
Diagnose routing, caching, latency issues, and WAF-related traffic blocks.
Investigate backend connectivity, health probes, and geo-routing behaviour.
NGINX / Reverse Proxies
Debug connection resets, upstream timeouts, and worker exhaustion.
Tune timeouts, buffers, keep-alive settings, and load-balancing strategies for high availability.
5. Database & Data Layer Reliability
Troubleshoot Azure SQL, Managed Instances, PostgreSQL, MySQL, and Cosmos DB.
Analyze slow queries, deadlocks, connection pool exhaustion, and resource contention.
Manage failovers, replication lag, throttling issues (DTU/RU limits), and high availability scenarios.
Collaborate on query optimization, execution plans, and indexing strategies.
6. Observability, Monitoring & Incident Management
Perform deep-dive analysis using Azure Monitor, Application Insights, and Log Analytics (KQL).
Correlate infrastructure, application, and dependency telemetry to identify root causes.
Lead P0/P1 incident bridges, driving coordinated resolution under pressure.
Produce blameless root cause analyses (RCAs) with actionable corrective and preventive measures.
7. Site Reliability Engineering (SRE) Practices
Define, measure, and continuously improve SLIs, SLOs, and error budgets.
Drive automation and self-healing solutions to improve service links reliability.
Improve deployment safety through blue-green, canary, and progressive delivery strategies.
Continuously reduce MTTR and eliminate recurring incidents through systemic fixes.
Skill Set:
· 12+ years of experience in production support / SRE roles
· Strong hands-on experience with Microsoft Azure
· Expert-level troubleshooting of:
· Azure VMs & networking
· Azure Application Gateway, Azure Front Door
· NGINX / reverse proxies
· Microservices & distributed systems
· Strong Linux internals & networking fundamentals
· Experience handling high-severity production incidents
· Security exposure (Azure WAF, Defender for Cloud)
· Ability to mentor L1/L2 teams
Qualifications:
Any Graduate with B. E / B. Tech, MCA or equivalent degree with more than 12+ years relevant work experience.
Cloud certifications preferred:
· Microsoft Certified: Azure Administrator Associate (AZ-104)
· Microsoft Certified: DevOps Engineer Expert (AZ-400)
· Microsoft Certified: Solutions Architect Expert (AZ-305)
Disclamier:
ViaPlus is an equal opportunity employer and are committed to building a diverse and inclusive workplace. All qualified applicants will receive consideration for employment without regard to race, ethnicity, religion, gender, sexual orientation, national origin, age, disability, marital status, or any other characteristic protected by applicable law.
We invite you to explore this opportunity to be a part of the ViaPlus family……