Can I apply directly for this job on this page?

Yes, you can begin your application on this page using a quick form. You'll then be redirected to the employer's career site to complete the full application process.

What is the role of a Model Serving Engineer at Jobgether?

The Model Serving Engineer position at Jobgether is a Full-time or part-time position opportunity in the Design field.

Where is this Model Serving Engineer job located?

United States, Other / Non-US, United States

What type of employment is offered for this Model Serving Engineer role?

Full-time or part-time position

What is the expected salary for this Model Serving Engineer job?

Compensation will be discussed during the hiring process.

Model Serving Engineer job near me in United States, Other / Non-US at Jobgether

Model Serving Engineer

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Model Serving Engineer in the United States.

This role focuses on building and operating high-performance, production-grade inference systems that power large-scale machine learning applications. You will design and optimize the infrastructure that serves models such as LLMs, vision systems, and recommendation engines, ensuring low latency, high throughput, and efficient GPU utilization. The position involves deep systems engineering work across request routing, autoscaling, caching, and observability to support reliable and scalable AI services. You will collaborate closely with ML engineers and product teams to ensure seamless deployment of new models and capabilities. The environment is highly technical and performance-driven, requiring strong expertise in distributed systems and real-time service optimization. This is a critical role for engineers passionate about making advanced AI models reliable, efficient, and production-ready at scale.

Accountabilities:

Design, build, and operate scalable model serving infrastructure for LLMs, vision models, and recommendation systems.
Optimize inference performance using techniques such as continuous batching, caching, request multiplexing, and GPU memory optimization.
Implement routing, rate limiting, and multi-tenant service policies to ensure reliability and fair resource usage across endpoints.
Develop autoscaling, capacity planning, and load balancing systems to maintain performance under varying workloads.
Build end-to-end observability systems, including metrics, logging, tracing, and performance monitoring for AI services.
Collaborate with ML and product teams to support model deployment, rollout strategies, and production integration.
Implement security, abuse detection, and API governance controls across model serving infrastructure.
Support incident response, debugging, and continuous reliability improvements for production AI systems.

Requirements:

Bachelor’s or Master’s degree in Computer Science or a related technical field.
6+ years of experience in distributed systems, infrastructure engineering, or ML platform engineering.
Strong proficiency in Python and a systems programming language such as Go, Rust, or C++.
Hands-on experience with large-scale model inference frameworks (e.g., vLLM, TensorRT-LLM, or similar).
Strong understanding of GPU architecture, memory management, and performance optimization techniques.
Experience with Kubernetes, cloud infrastructure, and autoscaling systems.
Expertise in observability tools including metrics, logging, and distributed tracing.
Strong background in performance engineering, low-latency systems, and capacity planning.
Excellent communication, incident response, and cross-functional collaboration skills.
Experience with AI serving optimization techniques such as quantization, caching, or distributed inference is a plus.

Benefits:

Competitive W2 compensation aligned with experience and technical expertise.
Fully remote, long-term position within the United States.
Comprehensive benefits package including medical, dental, and vision coverage.
401(k) retirement savings plan and financial wellness support.
Paid time off, holidays, and structured work-life balance.
Opportunity to work on cutting-edge AI inference systems and large-scale production platforms.
Strong technical growth in distributed systems, GPU computing, and AI infrastructure engineering.

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

Model Serving Engineer in United States at Jobgether

Explore Related Opportunities

Job Description

Scan to Apply

Job Location

Frequently asked questions about this position