Senior Infrastructure Engineer in United States at Jobgether
Explore Related Opportunities
Job Description
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Infrastructure Engineer based in the United States.
This is a high-impact infrastructure engineering role focused on building and operating the systems that power mission-critical financial and AI-driven products used at massive scale. You will own the reliability, scalability, and performance of cloud infrastructure that must seamlessly handle extreme seasonal traffic spikes, particularly during peak tax periods. The role combines deep platform engineering with strong operational ownership, ensuring that systems remain resilient under heavy load while enabling rapid product iteration. You will design and optimize cloud-native infrastructure, CI/CD pipelines, and observability systems that directly support user-facing experiences. Working closely with backend and AI teams, you will also help establish the foundational infrastructure for next-generation machine learning features. This is a hands-on role in a fast-paced, distributed environment where infrastructure decisions have immediate impact on millions of users.
- Operate, scale, and optimize cloud infrastructure supporting high-traffic consumer applications, ensuring reliability during extreme seasonal demand spikes.
- Manage and evolve Kubernetes-based environments, including autoscaling strategies, workload orchestration, and cluster performance tuning.
- Design and maintain database infrastructure at scale, including PostgreSQL/AlloyDB systems, read replicas, connection pooling, and performance optimization.
- Build and improve CI/CD pipelines using GitOps principles to enable fast, safe, and zero-downtime deployments.
- Implement and maintain infrastructure-as-code practices using Terraform, ensuring consistency, reliability, and reproducibility across environments.
- Develop and maintain observability systems, including monitoring, alerting, distributed tracing, and SLO/SLI frameworks.
- Secure infrastructure using modern cloud security practices such as IAM controls, secret management, WAF configurations, and network security tooling.
- Partner with engineering and AI teams to provision and support infrastructure for experimentation and production ML workloads.
- 5+ years of experience in infrastructure, platform engineering, or site reliability engineering roles.
- Strong hands-on experience with Google Cloud Platform services, including GKE, Cloud SQL/AlloyDB, Pub/Sub, GCS, IAM, and Secret Manager.
- Deep expertise in Kubernetes, including cluster management, autoscaling, workload identity, and production operations.
- Strong proficiency in Terraform, including module design, state management, and infrastructure-as-code best practices.
- Experience operating high-traffic, production-grade systems with strict uptime and performance requirements.
- Strong database operations experience, particularly with PostgreSQL and scaling strategies such as read replicas and connection pooling (e.g., PgBouncer).
- Experience designing CI/CD pipelines with zero-downtime deployments and fast rollback strategies.
- Strong understanding of observability systems, including metrics, logging, tracing, and alerting strategies.
- Ability to diagnose complex system issues under pressure and drive long-term reliability improvements.
- Strong collaboration skills with engineering, product, and data teams in fast-paced environments.
- High ownership mindset with the ability to independently drive infrastructure initiatives from design to production.
- Competitive compensation including base salary and equity (USD $165,750–$195,000 range)
- Comprehensive medical, dental, and vision insurance coverage
- 401(k) retirement plan and financial benefits
- Flexible paid time off and supportive remote-first work culture
- Equity participation in a high-growth, venture-backed organization
- Strong engineering culture focused on ownership, scalability, and reliability
- Opportunity to build infrastructure supporting millions of users and AI-driven products
- Exposure to cutting-edge cloud, data, and ML infrastructure systems.