JobTarget Logo

Senior DevOps / Platform Reliability Engineer in United States at Jobgether

NewJob Function: Engineering
Jobgether
United States, United States
Posted on
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

Job Description

Senior DevOps / Platform Reliability Engineer

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior DevOps / Platform Reliability Engineer in the United States.

This role sits at the intersection of platform engineering, SRE, and AI-driven operations, supporting a next-generation intelligent automation platform used by enterprise-scale customers. You will be responsible for building and evolving the infrastructure backbone that powers production AI and multi-agent systems at scale. The environment is highly technical and fast-moving, requiring strong ownership of CI/CD, cloud infrastructure, observability, and security. You will work closely with engineering teams to ensure safe, reliable, and scalable deployments across complex distributed systems. A key aspect of the role involves integrating modern AI tools into DevOps workflows to reduce operational toil and improve system intelligence. This is a high-impact position where your work directly shapes platform reliability, developer velocity, and production safety.

Accountabilities:
  • Own and evolve CI/CD pipelines using modern tools such as GitHub Actions, ensuring safe, scalable, and reversible deployments for microservices and AI workloads
  • Design and manage Infrastructure as Code solutions using Terraform and CloudFormation to automate provisioning and environment consistency
  • Operate and scale Kubernetes-based infrastructure (EKS + Argo CD), including autoscaling, ingress, security controls, and multi-tenant isolation
  • Manage cloud networking and edge infrastructure including Cloudflare, AWS networking services, API gateways, load balancers, and DNS configurations
  • Oversee data and event infrastructure such as Aurora MySQL, Redis, S3, and Kafka (MSK), ensuring reliability, backups, and disaster recovery readiness
  • Build and maintain serverless and event-driven systems using AWS Lambda where appropriate
  • Develop observability platforms using Prometheus, Grafana, and OpenTelemetry, including telemetry for AI/LLM systems and agentic workflows
  • Strengthen security and compliance posture (SOC 2, HIPAA) through IAM design, secrets management, scanning, and policy-as-code enforcement
  • Drive FinOps initiatives including cost optimization, workload attribution, and LLM usage cost control
  • Partner with engineering teams to define deployment standards, operational SLOs, and platform best practices
  • Improve system reliability through monitoring, incident response, automation, and continuous infrastructure improvements
  • Document infrastructure, processes, and operational standards to enable scalability and knowledge sharing
Requirements:
  • 5+ years of experience in DevOps, SRE, or Platform Engineering supporting production systems on AWS
  • Strong hands-on experience with CI/CD systems such as GitHub Actions, GitLab CI, Jenkins, or CircleCI
  • Deep experience operating Kubernetes environments (EKS preferred), including scaling, upgrades, and production operations
  • Strong AWS networking knowledge including VPC design, routing, security groups, load balancing, and DNS management
  • Proficiency with Terraform and Infrastructure as Code practices, ideally using OIDC-based authentication
  • Experience with production databases and storage systems including Aurora/RDS MySQL, Redis, and S3
  • Strong observability expertise using Prometheus, Grafana, and OpenTelemetry
  • Experience with Argo CD for GitOps-based deployments
  • Strong understanding of Cloudflare and AWS edge/networking services
  • Experience with Kafka/MSK and event-driven architectures
  • Strong scripting skills in Python, Bash, and Linux environments
  • Solid understanding of security practices including IAM, KMS, secrets management, and supply chain security
  • Experience with compliance and vulnerability scanning tools
  • Ability to work independently while collaborating effectively in high-ownership engineering teams
Benefits:
  • Competitive compensation package
  • 100% employer-covered employee health premiums
  • 75%–80% coverage for dependent health, dental, and vision plans
  • 401(k) retirement plan
  • Paid parental leave
  • Unlimited PTO policy
  • Fully remote work flexibility across the United States
  • Up to $200/month co-working space reimbursement
  • Home office stipend up to $500 for setup
  • Monthly $100 stipend for internet, phone, and related expenses
  • Opportunity to work on cutting-edge AI-native infrastructure and agentic systems
  • High-autonomy engineering culture focused on ownership and innovation
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.