Job Description
Company Description
TnT Techies Guide is a premier training and consulting firm dedicated to empowering individuals and businesses in the ever-evolving technology landscape. We specialize in delivering expert-level training, comprehensive guides, and consulting services tailored to meet the specific needs of tech professionals and organizations. Our programs focus on providing cutting-edge knowledge and practical skills across diverse technology domains. With a strong commitment to fostering innovation, we aim to help our clients excel in the fast-paced world of digital transformation.
Role Description
This is a full-time remote role for a Cloud SRE (Site Reliability Engineer). The Cloud SRE Engineer will design, implement, and maintain scalable and reliable cloud infrastructure. Daily tasks include monitoring system performance, troubleshooting issues, automating operational processes, optimizing infrastructure costs, and collaborating with teams to ensure high availability of cloud services. The role involves implementing best practices related to cloud environments, enhancing system security, and driving continuous improvements in infrastructure stability, scalability, and performance.
Qualifications
Design and maintain highly available, fault-tolerant, and multi-region cloud architectures to meet 99.9%+ uptime targets and business SLAs.
Define and manage SLIs, SLOs, and error budgets while leading incident response, root cause analysis (RCA), and postmortem processes.
Deploy, manage, and optimize production-grade Kubernetes clusters (EKS, AKS, GKE) including autoscaling, ingress management, and workload isolation.
Implement GitOps and CI/CD pipelines using tools such as ArgoCD, GitHub Actions, GitLab CI, Jenkins, or Azure DevOps to enable reliable and repeatable deployments.
Provision and manage cloud infrastructure using Infrastructure as Code tools such as Terraform or CloudFormation, ensuring modular and secure design patterns.
Build and maintain comprehensive observability solutions using Prometheus, Grafana, ELK/OpenSearch, Datadog, or OpenTelemetry to improve visibility and reduce MTTR.
Create actionable alerting strategies aligned with SLOs to reduce alert fatigue and improve operational efficiency.
Enforce security best practices including least-privilege IAM, network segmentation, secrets management, and runtime security controls.
Ensure compliance with industry standards such as SOC2, NIST, PCI DSS, and CIS Benchmarks within cloud and Kubernetes environments.
Conduct performance tuning, capacity planning, chaos engineering, and cost optimization initiatives to improve system scalability and cloud efficiency.
Required Qualifications
5+ years of experience in Cloud Engineering, DevOps, or Site Reliability Engineering roles.
Strong hands-on experience with AWS, Azure, or GCP in production environments.
Production-level Kubernetes experience including scaling, troubleshooting, and cluster hardening.
Expertise in Infrastructure as Code (Terraform preferred).
Experience building and maintaining CI/CD pipelines and automation workflows.
Strong Linux systems knowledge and networking fundamentals (DNS, TCP/IP, Load Balancing).
Proficiency in scripting languages such as Python, Go, or Bash.