Job Description

Job Title :

Site Reliability Engineer II Job Description About your role: We are seeking a highly skilled Technical Lead for AI Development to drive the architecture, design, and execution of advanced AI systems using LLM frameworks, multi-agent architectures, RAG pipelines, and Model Context Protocol (MCP) integrations. The ideal candidate has strong hands-on experience building production-grade AI features, orchestrating agent ecosystems, evaluating model performance, and iterating through continual refinements. You will lead a team of engineers, collaborate with product and research teams, and play a key role in shaping our AI strategy and platform capabilities. We are looking for a Staff Site Reliability Engineer to help us grow our domain expertise and provide support in a new global region to enable 24x7 development velocity as a global company. From AWS cloud provisioning as code to improving the developer experience in your working timezone, to acting as a guide to best practices around building and delivering software globally, we need an SRE with the passion, motivation, and great ideas to make everything better.

What you’ll do Automate the provisioning of all of Juniper Square’s infrastructure in code. Everything we do is in code! Partner with our Platform Engineering team on building developer tooling / improving developer experiences via joint initiatives and enhancements. Partner with our Data Engineering team on improving our data posture and driving operational excellence. Evolve our deployment pipelines to automate infrastructure deployments with the latest and greatest (and reliable) technologies. Improve metrics on our main services, and act as a subject matter expert for our global dev teams. Enable observability, SLO/SLI reporting, and respond to business impacting incidents as it pertains to infrastructure. Adopt and drive solutions that align with AWS Well Architected frameworks and Juniper Square’s business objectives. Identify performance bottlenecks and provide recommendations for improvement. Proactively identify and solve problems that we didn’t even know we had. Help build, deploy, and scale a load testing environment that is analogous to production. Enforce security and operational safety controls. Participate in technical roadmap planning and estimation. Participate and contribute in production readiness and architecture review board (ARB) meetings and forums. Train and mentor future engineers in the same region. Contribute to the architectural improvements to meet future scaling and observability requirements

Qualifications A profound love for solving hard problems and overcoming challenging obstacles. Putting your customers first, whether they be internal or external, and making them more productive, happy, and successful. Experience with AWS. Other public cloud providers are a bonus. Experience with PostgreSQL is a must. Additional experience with document databases is a nice-to-have. Experience with cloud security best practices (CSPM, CDR, CWPP, SIEM, etc) to keep our customers and cloud posture secure. Experience with containers (builds, registries, vulnerabilities scanning, run-time with docker-compose, run-time with TILT, run-time in schedulers/orchestration systems). Multi-year hands-on experience and fluency with Kubernetes and helm charts are an absolute skill requirement. We live and breathe the k8s ecosystem. Experience with a CI/CD pipeline. We use a combination of Github Actions, ArgoCD, Helm and GitOps in our deployment process, but again, any are fine. Some sort of infrastructure-as-code system: Ansible, Terraform, CloudFormation, CDK, etc. We use Python and Typescript, so knowledge and exposure with either is a strong plus. Experience breaking up monolithic architectures into microservices Experience with service meshes and service discovery solutions. Experience with an observability solution: New Relic, Prometheus, DataDog, etc. Experience with logging systems: CloudWatch, ELK, Splunk, etc. Bachelor’s degree in Computer Science or similar or equivalent experience Key Responsibilities: AI Architecture & Development Design and implement multi-agent systems, including agent orchestration, delegation, and tool interaction patterns. Build scalable RAG (Retrieval-Augmented Generation) architectures using vector databases, embedding pipelines, and data chunking strategies. Integrate and extend MCP (Model Context Protocol) tools for robust model-tool communication and workflow automation. Lead development of AI-based features, prototypes, and production solutions using LLM APIs or self-hosted models. Architect and optimize prompt engineering, prompt chains, agent loops, and refinement pipelines. Model Evaluation & Continuous Improvement Implement and maintain agent evaluation frameworks (agent evals, scenario tests, regression testing). Design automated evaluation harnesses for LLM quality, reliability, hallucination control, and performance metrics. Drive iterative improvements through A/B testing, reward models, and feedback loops. Monitor system performance, latency, cost, and reliability — and implement optimization strategies. Technical Leadership Lead and mentor engineers working on AI, data, and backend components. Collaborate with product managers, researchers, and cross-functional teams to align tech strategy with business outcomes. Conduct code reviews, enforce best practices, and maintain architectural standards. Own technical roadmaps, sprint planning, and engineering execution. Systems & Infrastructure Work with cloud platforms (AWS/GCP/Azure) to deploy scalable AI services. Integrate vector databases (Pinecone, Weaviate, Elasticsearch, etc.). Build APIs and microservices to expose AI capabilities to internal and external stakeholders. Maintain secure, compliant, and efficient data pipelines for ingestion and retrieval.

Qualifications Bachelor’s/Master’s degree in Computer Science, Engineering, AI, or related field. 8+ years of software engineering experience with strong backend architecture skills. 3+ years deep experience with LLMs, GPT models, agents, or advanced ML systems. Strong hands-on experience with: MCP tools and LLM tool integration Agent frameworks (e.g., OpenAI Agents, LangChain, LlamaIndex, custom agents) RAG pipelines, embedding models, vector stores Agent evaluation, reliability testing, and model refinements Proficiency in Python, TypeScript/Node.js, or similar languages. Experience deploying LLM apps and APIs in production environments. Deep understanding of AI limitations, hallucination control, and safety measures.

Preferred / Nice to Have Experience with: Fine-tuning LLMs OpenAI API, Claude, or Azure OpenAI Distributed embeddings and high-throughput retrieval systems MLOps frameworks Knowledge of DevOps, CI/CD, containerization (Docker/Kubernetes). Prior leadership experience managing small to mid-size engineering teams

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now

Job Details

Posted Date: February 28, 2026

Job Type: Construction

Location: India

Company: DevRabbit IT Solutions

Ready to Apply?

Don't miss this opportunity! Apply now and join our team.

Apply Now