SRE (Site Reliability Engineering) in Brazil, Indiana at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a SRE (Site Reliability Engineering) in Brazil.
This role sits within a high-impact MLOps environment, focused on ensuring the reliability, scalability, and performance of infrastructure that supports machine learning models and production data pipelines. You will be part of a collaborative and engineering-driven team, working on modern cloud-native systems in a fast-paced and continuously evolving context. The position involves direct contribution to the stability of critical platforms running on AWS and Kubernetes, with strong emphasis on automation, observability, and operational excellence.
You will work closely with data, development, and infrastructure teams to improve system resilience and delivery efficiency.
The environment promotes continuous learning, ownership, and proactive problem-solving in complex distributed systems.
This is an opportunity to have a direct impact on large-scale production systems while growing your expertise in SRE, DevOps, and MLOps practices.
- Implement and maintain infrastructure as code using Terraform, following established engineering standards and best practices.
- Operate and support Kubernetes clusters using Helm and GitOps methodologies to ensure reliable application delivery.
- Manage day-to-day operations of AWS environments, contributing to platform availability, scalability, and stability.
- Assist in diagnosing and troubleshooting cloud networking issues (VPC, security groups, DNS, load balancers), escalating complex cases when needed.
- Maintain and optimize CI/CD pipelines using GitLab in collaboration with development and data teams.
- Monitor systems using observability tools such as Prometheus, Grafana, and Datadog, supporting incident detection and response.
- Participate in incident management and post-mortem processes, contributing to root cause analysis and preventive improvements.
- Support FinOps initiatives by identifying opportunities for cloud cost optimization.
Requirements:
- Solid hands-on experience with Terraform for infrastructure as code.
- Strong knowledge of AWS cloud services and architecture.
- Intermediate experience with Kubernetes, including Helm and GitOps workflows.
- Experience working with GitLab CI/CD pipelines and version control workflows.
- Ability to troubleshoot networking in cloud environments (VPC, DNS, security groups, load balancers).
- Good understanding of Linux systems administration.
- Familiarity with observability tools such as Prometheus, Grafana, and Datadog.
- Strong analytical thinking and problem-solving skills in distributed systems environments.
- Clear communication skills and ability to collaborate in cross-functional teams.
- Proactive mindset with ownership and willingness to learn and grow in SRE/MLOps contexts.
- Nice to have: exposure to FinOps practices and interest in MLOps or Data Engineering environments.
Benefits:
- Remote work model within Brazil
- Flexible working arrangements
- Competitive compensation package (based on experience)
- Health and dental insurance
- Continuous learning and development opportunities
- Exposure to large-scale cloud and machine learning infrastructure
- Collaborative engineering culture focused on innovation and knowledge sharing
- Career growth opportunities in SRE, DevOps, and MLOps domains
- Inclusion in a diverse and supportive tech community