Can I apply directly for this job on this page?

Yes, you can begin your application on this page using a quick form. You'll then be redirected to the employer's career site to complete the full application process.

What is the role of a Senior Site Reliability Engineer at Jobgether?

The Senior Site Reliability Engineer position at Jobgether is a Full-time or part-time position opportunity in the relevant field.

Where is this Senior Site Reliability Engineer job located?

United States, Other / Non-US, United States

What type of employment is offered for this Senior Site Reliability Engineer role?

Full-time or part-time position

What industry does this Senior Site Reliability Engineer position belong to?

This role spans multiple industries.

What is the expected salary for this Senior Site Reliability Engineer job?

Compensation will be discussed during the hiring process.

Senior Site Reliability Engineer job near me in United States, Other / Non-US at Jobgether

Senior Site Reliability Engineer

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Site Reliability Engineer based in the United States.

This role sits at the heart of large-scale production reliability and cloud operations within a highly regulated financial services environment. You will be responsible for ensuring the stability, observability, and performance of mission-critical systems that power modern banking and payments platforms. The position blends deep hands-on engineering with operational excellence, focusing on reducing noise, improving signal quality, and strengthening incident response practices. You will work closely with cross-functional engineering and operations teams to design resilient alerting frameworks, refine production monitoring strategies, and continuously improve system reliability. This is a high-impact role for an engineer who thrives in complex AWS environments and enjoys turning operational chaos into structured, scalable processes.

Accountabilities:

Own and improve production reliability across large-scale distributed systems, ensuring high availability and performance in critical financial infrastructure environments.
Design, refine, and maintain observability and monitoring systems using tools such as Splunk, Datadog, and ServiceNow, focusing on actionable insights rather than alert noise.
Reduce alert fatigue by analyzing existing monitoring signals, eliminating false positives, and improving severity classification frameworks and escalation paths.
Develop and maintain incident response playbooks, ensuring clear operational procedures for troubleshooting, mitigation, and post-incident review.
Lead efforts to troubleshoot complex production issues in AWS-based environments, ensuring rapid identification and resolution of system failures.
Collaborate with engineering, infrastructure, and product teams to improve system reliability, scalability, and operational efficiency.
Continuously enhance operational maturity by introducing automation, observability improvements, and best practices for production support.

Requirements:

This role requires a strong background in site reliability engineering, production support, and cloud infrastructure, with a focus on high-scale, regulated environments. The ideal candidate brings extensive hands-on experience with AWS, observability tools, and production incident management, along with a proven ability to reduce operational noise and improve system signal quality. Strong analytical and communication skills are essential, as this role requires collaboration across technical and non-technical stakeholders.

Extensive experience in Site Reliability Engineering, production support, or infrastructure engineering roles.
Strong expertise in AWS services and cloud-native architectures.
Proven experience with observability tools such as Splunk, Datadog, or similar monitoring platforms.
Demonstrated ability to reduce alert noise, improve signal-to-noise ratios, and design effective alerting strategies.
Experience building incident response playbooks, severity frameworks, and operational runbooks.
Strong troubleshooting skills in complex distributed systems and production environments.
Experience working in regulated industries such as financial services, banking, or payments is highly preferred.
Excellent communication skills with the ability to coordinate across engineering and operations teams.

Benefits:

Competitive compensation package
Flexible work arrangements with remote options
Professional development and continuous learning opportunities
Exposure to large-scale financial systems and modern cloud infrastructure
Collaborative, engineering-driven culture focused on innovation
Supportive environment encouraging ownership and autonomy
Tools and resources to support operational excellence and career growth

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

Senior Site Reliability Engineer in United States at Jobgether

Explore Related Opportunities

Job Description

Scan to Apply

Job Location

Frequently asked questions about this position