JobTarget Logo

Staff MLOps Learning Engineer in at Ellkay LLC

NewJob Function: Human Resources
Ellkay LLC
India
Posted on
New job! Apply early to increase your chances of getting hired.

Explore Related Opportunities

Job Description

ELLKAY started out providing connectivity solutions to laboratories and within a few years, grew to also provide data management solutions to ambulatory organizations. ELLKAY is now a trusted data management partner in five healthcare segments. ELLKAY’s solutions continue to serve laboratories and ambulatory practices and have expanded to empower hospitals and health systems, healthcare IT vendors, ambulatory practices, health plans, and other healthcare organizations with cutting-edge technologies and solutions that drive their growth and interoperability strategies.

Today, ELLKAY remains true to our core values, building strong partner relationships and offering unparalleled service and support while providing innovative, scalable solutions to the challenges our customers face in today’s data-rich world.

ELLKAY's experience, customer-focused approach, and reputation for innovation, speed, and accuracy differentiate ELLKAY as a premier partner for your interoperability needs and data management strategy.

Job Description:

ELLKAY is looking for an MLOps/GenAIOps Engineer to own the end-to-end machine learning operations lifecycle. This role focuses on automating model training, evaluation, deployment, and monitoring workflows while partnering closely with our Full Stack AI/ML Engineers to ensure seamless production deployment of foundation models and classical ML systems. You'll build the infrastructure that enables rapid, safe iteration on AI/ML solutions that impact patient care

Essential Duties & Responsibilities:

AWS SageMaker Operations & Orchestration

• Design and implement SageMaker Pipelines with complex DAG authoring for end-to-end ML workflows including data preprocessing, training, evaluation, and deployment

• Configure and optimize SageMaker Processing Jobs, Training Jobs (PyTorch/HuggingFace containers), and Managed Spot Instances for cost-effective model training

• Manage SageMaker Model Registry with versioning, lineage tracking, and approval workflows for model governance

• Implement sophisticated endpoint deployment strategies including blue/green traffic routing, canary deployments, and A/B testing configurations

• Configure and maintain SageMaker Model Monitor for continuous data quality monitoring, model quality assessment, bias detection, and feature drift alerting

GenAI Model Lifecycle Management

• Build operational frameworks for Amazon Bedrock model deployments including provisioned throughput management, guardrail configuration, and usage monitoring

• Implement automated evaluation pipelines for foundation model outputs with quality gates and human-in-the-loop review workflows

• Design prompt versioning systems integrated with model registry for reproducible GenAI deployments

• Monitor GenAI model performance including latency, cost per inference, hallucination rates, and guardrail trigger metrics

Model Lifecycle Automation

• Develop automated retraining trigger systems based on data drift, performance degradation, scheduled intervals, or manual triggers

• Implement evaluation gates as code with configurable thresholds for metrics (F1, precision, recall, AUC, etc.) before model promotion

• Build automated model promotion workflows from dev → staging → prod with approval gates and rollback capabilities

• Design rollback automation with traffic shifting strategies to quickly revert to previous model versions upon performance degradation

Training Data Engineering

• Build AWS Glue ETL pipelines for training corpus assembly, transformation, and feature engineering at scale

• Design dataset versioning systems using S3 + manifest files ensuring reproducibility and lineage tracking across model training runs

• Establish reproducibility patterns including seed management, environment pinning, and deterministic data splitting

Container Engineering & Registry Management

• Build optimized Docker containers for training and inference workloads with multi-stage builds, layer caching, and security scanning

• Manage Amazon ECR repositories with lifecycle policies, image scanning, and vulnerability remediation workflows

• Implement bring-your-own-container (BYOC) patterns for SageMaker supporting custom frameworks and dependencies

• Optimize container startup times and resource utilization for cost-effective inference

ML Observability & Monitoring

• Design comprehensive CloudWatch dashboards for model health including accuracy, latency, throughput, error rates, and drift metrics

• Implement custom CloudWatch metrics for business-specific KPIs (e.g., clinical accuracy per condition, false positive rates for critical alerts)

• Build alerting systems for accuracy regression, data drift, concept drift, and model staleness with appropriate escalation paths

• Create observability frameworks that integrate with AWS X-Ray for end-to-end request tracing from API call through model inference

CI/CD for ML Artifacts

• Design and implement GitHub Actions or AWS CodePipeline workflows for automated ML artifact testing, validation, and deployment

• Build multi-stage promotion pipelines (dev → staging → prod) with automated testing gates and manual approval checkpoints

• Implement artifact versioning and lineage tracking for models, datasets, feature transformations, and deployment configurations

• Create integration testing frameworks for ML APIs including performance benchmarking and regression testing

Infrastructure & Compliance

• Collaborate with Full Stack AI/ML Engineers to define infrastructure requirements and translate them into production-grade AWS CDK constructs

• Implement HIPAA-compliant ML pipelines with appropriate encryption, access controls, audit logging, and data residency requirements

• Design multi-tenant ML infrastructure with tenant isolation, resource quotas, and cost allocation

• Ensure reproducibility and auditability of all ML experiments and deployments for regulatory compliance

Qualifications:

Technical Skills

• AWS SageMaker Expertise: Deep hands-on experience with SageMaker Pipelines, Processing/Training Jobs, Model Registry, Endpoints, Model Monitor, and Managed Spot training

• AWS Bedrock & GenAI Operations: Production experience managing Amazon Bedrock deployments including Claude family models, Titan Embeddings v2, and Bedrock Guardrails (PHI/PII detection); expertise in provisioned throughput vs on-demand capacity management, model evaluation automation, prompt versioning and registry integration, inference monitoring (latency, cost, token usage), and guardrail performance tracking; ability to build automated quality gates for foundation model outputs

• AWS ML Services: Proficiency with AWS Glue ETL, Bedrock, S3, ECR, CloudWatch, and Step Functions

• Container Technologies: Strong Docker skills including multi-stage builds, optimization techniques, and container orchestration

• Python Programming: Proficient in Python for pipeline orchestration, data processing, and automation scripts (boto3, sagemaker SDK, pandas/polars)

• CI/CD Tools: Experience with GitHub Actions, AWS CodePipeline, or similar CI/CD platforms for ML workflows

• Infrastructure as Code: Familiarity with AWS CDK, CloudFormation, or Terraform for ML infrastructure provisioning

• ML Fundamentals: Solid understanding of ML training concepts, evaluation metrics, hyperparameter tuning, and model validation techniques

• Observability: Experience building custom metrics, dashboards, and alerting systems for production ML systems

Experience

• 8+ years of software/data engineering experience with at least 3 years focused on MLOps or production ML systems

• Proven track record of building automated ML pipelines and deployment workflows at scale

• Experience with model monitoring, drift detection, and automated retraining systems

• Healthcare or regulated industry experience strongly preferred

• Demonstrated ability to operationalize both classical ML and foundation model systems

Soft Skills

• Strong collaboration skills to work closely with Full Stack AI/ML Engineers, data scientists, and platform teams

• Systems thinking with ability to design reliable, fault-tolerant ML infrastructure

• Excellent troubleshooting and debugging skills for complex distributed systems

• Commitment to automation, reproducibility, and engineering excellence

• Strong documentation skills for runbooks, architecture decisions, and operational procedures

Preferred Qualifications

• Experience with Amazon Bedrock operational patterns and GenAI model monitoring

• Familiarity with PyTorch, HuggingFace Transformers, and deep learning training workflows

• Knowledge of FHIR, HL7 v2, and healthcare data standards

• Experience with cost optimization strategies for ML workloads (Spot instances, autoscaling, model optimization)

• Contributions to open-source MLOps tools or frameworks

• AWS certifications (Machine Learning Specialty, Solutions Architect, or DevOps Engineer)

Additional Information:

ELLKAY is committed to fostering a collaborative and high-performance work environment that supports innovation, teamwork, and professional growth. Most roles are designed to operate from our office locations to encourage effective collaboration and engagement across teams.
Any alternative work arrangements may be considered at the company’s discretion based on role requirements and business needs.
For more information about our company, please visit www.ELLKAY.com.
ELLKAY is a Smoke-Free Workplace.

ELLKAY, LLC provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.

Job Location

India

Frequently asked questions about this position

Apply NowYour application goes straight to the hiring team