Job Description

At Nucleus, reliability is a product feature of every platform we build. We’re hiring a

Software Engineer, Infrastructure Reliability

to improve uptime, observability, and incident response across Nucleus’s core infrastructure and platforms. This role is focused on the engineering discipline required to keep critical systems dependable: reducing failure modes, strengthening detection and response, and building the tooling and practices that make infrastructure more understandable and resilient over time. You will work across cloud systems, internal platforms, and production services to raise the reliability baseline for the company as a whole. What you’ll do

Improve the uptime, resilience, and operational health of Nucleus’s core infrastructure and platform services. Build observability systems, alerts, dashboards, and tooling that make production environments easier to understand and operate. Strengthen incident response workflows, root cause analysis practices, and long-term remediation efforts. Partner with infrastructure and product teams to identify reliability risks and design durable improvements. Help define service level objectives, reliability standards, and operational best practices across the company. Automate operational workflows related to health checks, failover, recovery, and routine infrastructure maintenance. Analyze production incidents and recurring failure patterns to improve architecture, tooling, and team response. Contribute to capacity planning, resilience testing, and other efforts that prepare systems for growth and change. What we’re looking for

Strong experience operating production infrastructure or backend systems with meaningful reliability requirements. Familiarity with observability stacks, on-call practices, incident management, and operational excellence in distributed systems. Experience with monitoring, logging, tracing, and performance analysis tools in cloud or service-based environments. Strong software engineering skills in one or more languages such as Go, Python, Java, or Rust. Comfort debugging across multiple layers of infrastructure, from services and deployments to underlying platform behavior. Sound judgment around reliability tradeoffs, especially where speed, complexity, and risk must be balanced. A disciplined, systems-oriented approach to understanding and preventing operational failures. Interest in building the reliability foundations behind large-scale AI systems. Why Nucleus

Nucleus is building intelligent systems that people and organizations will depend on in meaningful ways. That trust starts with infrastructure that is observable, resilient, and engineered to recover gracefully under pressure. In this role, you’ll help build the operational backbone of Nucleus—improving not just how systems perform on good days, but how they behave on the hardest ones. Your work will shape the quality and dependability of the platforms the rest of the company builds on. If you care about reliability as both an engineering craft and a force multiplier for ambitious teams, we’d love to hear from you.

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now

Job Details

Posted Date: March 22, 2026

Job Type: Technology

Location: India

Company: Nucleus AI

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now