Descrição da Vaga
About Us
At MetaCTO, we specialize in helping startups and growing companies turn visionary ideas into successful digital products through expert app development and fractional CTO services. As a
Site Reliability Engineer (SRE) , you will play a critical role in ensuring the reliability, scalability, and security of the backend infrastructure that powers innovative applications for our clients. This role will involve managing cloud environments, optimizing databases, automating deployments, and improving system observability.
Job Description
As a
Site Reliability Engineer (SRE) at MetaCTO , you will be responsible for designing, implementing, and maintaining highly available, scalable, and secure infrastructure solutions. You will collaborate with software engineers to improve system performance, automate operations, and ensure the smooth functioning of critical backend services. Youāll work extensively with cloud platforms like AWS, leveraging technologies such as Terraform, Docker, Kubernetes, and CI/CD pipelines to enhance system reliability.
Responsibilities
Architect, build, and maintain cloud infrastructure on
AWS
(Lambda, EC2, RDS, S3, EKS, SQS, CloudWatch).
Manage and optimize databases ( MySQL, PostgreSQL ) for performance, reliability, and security.
Implement
monitoring, alerting, and logging
solutions to ensure system health and performance, with specific experience using
Zabbix
and
Elastic Logging .
Design and maintain
CI/CD pipelines
for automated deployment and scaling of applications.
Work with
containerization and orchestration tools
such as
Docker
and
Kubernetes .
Develop and enforce
security best practices
for cloud environments and infrastructure.
Automate operational processes using
Infrastructure-as-Code (Terraform, CloudFormation)
and scripting languages like Python or Bash.
Troubleshoot and resolve infrastructure-related incidents and optimize system performance.
Collaborate with backend engineers to ensure high availability, fault tolerance, and scalable system design, with a strong focus on
Django-based applications .
Qualifications
5-10 years
of experience in
Site Reliability Engineering (SRE), DevOps, or Cloud Engineering
roles.
Strong expertise in
AWS
cloud services ( EC2, RDS, S3, Lambda, CloudFront, EKS, SQS, IAM ).
Hands-on experience with
containerization (Docker) and orchestration (Kubernetes, ECS, or EKS) .
Deep knowledge of
relational databases (MySQL, PostgreSQL) , including performance tuning, query optimization, monitoring, and migration management.
Proficiency in
Infrastructure-as-Code tools
such as
Terraform, CloudFormation, or Pulumi .
Strong experience with
CI/CD pipelines
and automation tools ( GitHub Actions, Jenkins, CircleCI, or GitLab CI/CD ).
Proficiency in
monitoring tools , specifically
Zabbix , and logging solutions like
Elastic Logging .
Scripting experience with
Python, Bash, or Go
for automating operational tasks.
Experience working with
Django-based applications
in a cloud environment.
Experience implementing security best practices for cloud-based applications.
Knowledge of distributed systems and
microservices architecture .
Preferred Skills
AWS certifications (Solutions Architect, DevOps Engineer) are a plus.
Experience with
serverless computing
and event-driven architectures.
Familiarity with
message queue services
(SQS, RabbitMQ, Kafka).
Understanding of
zero-downtime deployments
and disaster recovery strategies.
Position Details
Type:
Full-Time
Location:
100% Remote
Hours:
US Pacific Time hours
How to Apply
If you are passionate about
scalability, automation, and reliability , and thrive in a collaborative, fast-paced environment, weād love to hear from you. Please submit your
resume
and an optional
brief cover letter
outlining your relevant experience.
MetaCTO
is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.