JobTarget Logo

Senior Site Reliability Engineer in United States at Jobgether

New
Jobgether
United States, United States
Posted on
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

Job Description

Senior Site Reliability Engineer

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Site Reliability Engineer based in the United States.

This role sits at the heart of large-scale production reliability and cloud operations within a highly regulated financial services environment. You will be responsible for ensuring the stability, observability, and performance of mission-critical systems that power modern banking and payments platforms. The position blends deep hands-on engineering with operational excellence, focusing on reducing noise, improving signal quality, and strengthening incident response practices. You will work closely with cross-functional engineering and operations teams to design resilient alerting frameworks, refine production monitoring strategies, and continuously improve system reliability. This is a high-impact role for an engineer who thrives in complex AWS environments and enjoys turning operational chaos into structured, scalable processes.

Accountabilities:
  • Own and improve production reliability across large-scale distributed systems, ensuring high availability and performance in critical financial infrastructure environments.
  • Design, refine, and maintain observability and monitoring systems using tools such as Splunk, Datadog, and ServiceNow, focusing on actionable insights rather than alert noise.
  • Reduce alert fatigue by analyzing existing monitoring signals, eliminating false positives, and improving severity classification frameworks and escalation paths.
  • Develop and maintain incident response playbooks, ensuring clear operational procedures for troubleshooting, mitigation, and post-incident review.
  • Lead efforts to troubleshoot complex production issues in AWS-based environments, ensuring rapid identification and resolution of system failures.
  • Collaborate with engineering, infrastructure, and product teams to improve system reliability, scalability, and operational efficiency.
  • Continuously enhance operational maturity by introducing automation, observability improvements, and best practices for production support.
Requirements:

This role requires a strong background in site reliability engineering, production support, and cloud infrastructure, with a focus on high-scale, regulated environments. The ideal candidate brings extensive hands-on experience with AWS, observability tools, and production incident management, along with a proven ability to reduce operational noise and improve system signal quality. Strong analytical and communication skills are essential, as this role requires collaboration across technical and non-technical stakeholders.

  • Extensive experience in Site Reliability Engineering, production support, or infrastructure engineering roles.
  • Strong expertise in AWS services and cloud-native architectures.
  • Proven experience with observability tools such as Splunk, Datadog, or similar monitoring platforms.
  • Demonstrated ability to reduce alert noise, improve signal-to-noise ratios, and design effective alerting strategies.
  • Experience building incident response playbooks, severity frameworks, and operational runbooks.
  • Strong troubleshooting skills in complex distributed systems and production environments.
  • Experience working in regulated industries such as financial services, banking, or payments is highly preferred.
  • Excellent communication skills with the ability to coordinate across engineering and operations teams.
Benefits:
  • Competitive compensation package
  • Flexible work arrangements with remote options
  • Professional development and continuous learning opportunities
  • Exposure to large-scale financial systems and modern cloud infrastructure
  • Collaborative, engineering-driven culture focused on innovation
  • Supportive environment encouraging ownership and autonomy
  • Tools and resources to support operational excellence and career growth
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1

Job Location

United States, United States

Frequently asked questions about this position

Continue to apply
Enter your email to continue. You’ll be redirected to the employer’s application.
By clicking Continue, you understand and agree to JobTarget's Terms of Use and Privacy Policy.