Senior DevOps / Platform Reliability Engineer in United States at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior DevOps / Platform Reliability Engineer in the United States.
This role sits at the intersection of platform engineering, SRE, and AI-driven operations, supporting a next-generation intelligent automation platform used by enterprise-scale customers. You will be responsible for building and evolving the infrastructure backbone that powers production AI and multi-agent systems at scale. The environment is highly technical and fast-moving, requiring strong ownership of CI/CD, cloud infrastructure, observability, and security. You will work closely with engineering teams to ensure safe, reliable, and scalable deployments across complex distributed systems. A key aspect of the role involves integrating modern AI tools into DevOps workflows to reduce operational toil and improve system intelligence. This is a high-impact position where your work directly shapes platform reliability, developer velocity, and production safety.
- Own and evolve CI/CD pipelines using modern tools such as GitHub Actions, ensuring safe, scalable, and reversible deployments for microservices and AI workloads
- Design and manage Infrastructure as Code solutions using Terraform and CloudFormation to automate provisioning and environment consistency
- Operate and scale Kubernetes-based infrastructure (EKS + Argo CD), including autoscaling, ingress, security controls, and multi-tenant isolation
- Manage cloud networking and edge infrastructure including Cloudflare, AWS networking services, API gateways, load balancers, and DNS configurations
- Oversee data and event infrastructure such as Aurora MySQL, Redis, S3, and Kafka (MSK), ensuring reliability, backups, and disaster recovery readiness
- Build and maintain serverless and event-driven systems using AWS Lambda where appropriate
- Develop observability platforms using Prometheus, Grafana, and OpenTelemetry, including telemetry for AI/LLM systems and agentic workflows
- Strengthen security and compliance posture (SOC 2, HIPAA) through IAM design, secrets management, scanning, and policy-as-code enforcement
- Drive FinOps initiatives including cost optimization, workload attribution, and LLM usage cost control
- Partner with engineering teams to define deployment standards, operational SLOs, and platform best practices
- Improve system reliability through monitoring, incident response, automation, and continuous infrastructure improvements
- Document infrastructure, processes, and operational standards to enable scalability and knowledge sharing
- 5+ years of experience in DevOps, SRE, or Platform Engineering supporting production systems on AWS
- Strong hands-on experience with CI/CD systems such as GitHub Actions, GitLab CI, Jenkins, or CircleCI
- Deep experience operating Kubernetes environments (EKS preferred), including scaling, upgrades, and production operations
- Strong AWS networking knowledge including VPC design, routing, security groups, load balancing, and DNS management
- Proficiency with Terraform and Infrastructure as Code practices, ideally using OIDC-based authentication
- Experience with production databases and storage systems including Aurora/RDS MySQL, Redis, and S3
- Strong observability expertise using Prometheus, Grafana, and OpenTelemetry
- Experience with Argo CD for GitOps-based deployments
- Strong understanding of Cloudflare and AWS edge/networking services
- Experience with Kafka/MSK and event-driven architectures
- Strong scripting skills in Python, Bash, and Linux environments
- Solid understanding of security practices including IAM, KMS, secrets management, and supply chain security
- Experience with compliance and vulnerability scanning tools
- Ability to work independently while collaborating effectively in high-ownership engineering teams
- Competitive compensation package
- 100% employer-covered employee health premiums
- 75%–80% coverage for dependent health, dental, and vision plans
- 401(k) retirement plan
- Paid parental leave
- Unlimited PTO policy
- Fully remote work flexibility across the United States
- Up to $200/month co-working space reimbursement
- Home office stipend up to $500 for setup
- Monthly $100 stipend for internet, phone, and related expenses
- Opportunity to work on cutting-edge AI-native infrastructure and agentic systems
- High-autonomy engineering culture focused on ownership and innovation