Software Platform Support Engineer - GPU Cloud at Jobgether – United States
Explore Related Opportunities
About This Position
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Software Platform Support Engineer - GPU Cloud in United States.
This role sits at the intersection of cloud infrastructure, high-performance computing, and customer support, focusing on enabling seamless experiences across advanced GPU-based platforms. You will work closely with internal engineering, SRE, and product teams to support large-scale distributed systems powering AI and accelerated computing workloads. Acting as a technical bridge between users and platform engineering, you will troubleshoot complex issues, improve system reliability, and enhance operational processes in a fast-moving environment. The role requires deep curiosity about how systems work end-to-end, from compute and storage to networking layers. You will also contribute to building internal tools, documentation, and workflows that improve support efficiency and platform observability. This is a highly collaborative, high-impact position within a cutting-edge AI and GPU cloud ecosystem.
- Provide Tier 1 support for complex GPU cloud platforms in collaboration with internal engineering, SRE, and infrastructure teams.
- Diagnose, investigate, and resolve customer and system issues while performing root cause analysis and escalating when needed.
- Partner with Site Reliability Engineering teams to file bugs, track incidents, and ensure timely resolution of platform issues.
- Develop and improve operational workflows, including runbooks, escalation paths, and support documentation.
- Build internal tooling and automation to enhance support efficiency, visibility, and issue resolution speed.
- Analyze user workloads and system behavior to better understand usage patterns and optimize platform performance.
- Collaborate with engineering teams to provide feedback, identify improvements, and contribute to platform enhancements.
- Participate in on-call rotations to support production systems and ensure platform reliability.
- Bachelor’s or Master’s degree in Computer Science or a related field, or equivalent practical experience.
- 2+ years of experience supporting distributed systems, cloud platforms, or end-user software environments.
- Strong experience with Linux-based systems and troubleshooting complex infrastructure issues.
- Hands-on knowledge of cloud platforms such as AWS, Azure, GCP, or OCI.
- Understanding of infrastructure components including networking, storage, and DevOps tooling/scripting.
- Familiarity with data storage systems such as databases, file, block, and object storage.
- Strong troubleshooting skills with the ability to analyze issues across multiple system layers.
- Excellent communication skills and customer-focused mindset for supporting internal users.
- Ability to work across teams and adapt to different layers of the technology stack.
- Bonus: Experience with MLOps, GPU workloads, distributed training systems, or HPC environments (e.g., SLURM).
- Strong organizational skills and a continuous improvement mindset.
- Competitive base salary ranging from $76,000 to $172,500 depending on level, experience, and location.
- Equity opportunities as part of the total compensation package.
- Comprehensive health, dental, and vision insurance coverage.
- Retirement savings plans and additional financial wellness benefits.
- Flexible work environment with opportunities for collaboration across global teams.
- Exposure to cutting-edge AI, GPU cloud, and high-performance computing technologies.
- Professional growth opportunities in a world-leading AI and accelerated computing organization.
- Inclusive and innovative work culture focused on impact, learning, and technical excellence.