Senior Site Reliability Engineer (AI-Native) in St Julian's at Paradise Media LLC
Explore Related Opportunities
Job Description
Paradise Media is a fast-growing performance marketing company behind some of the most successful affiliate and iGaming brands in the world. We run a global network of high-authority sites across casino, sports, and entertainment built on data, experimentation, and top-tier SEO.
We're a private company with strong capital reserves and no outside investors, making us a stable, independent, and fast-moving place to grow your career. You'll work directly with the CEO and leadership team, have a real voice in strategy, and see your ideas go live fast.
We're scaling quickly to become one of the largest privately-owned companies in iGaming. A team where smart, driven people can have a massive impact and build something enduring.
About the role
We are seeking a Senior AI-Native Site Reliability Engineer to lead the reliability, performance, security, automation, and operational maturity of a growing portfolio of high-performing web platforms and digital products.
This role is ideal for a pragmatic senior reliability engineer who can operate and improve production systems, automate repetitive work, use AI safely to accelerate operations, understand performance and security deeply, and communicate clearly during incidents.
You will combine senior-level SRE, DevOps, infrastructure, security, and platform engineering expertise with a modern AI-first approach to operations. You will be expected not only to maintain systems, but to improve how they are designed, monitored, deployed, secured, and operated.
The successful candidate will be comfortable owning critical production environments across varied technology stacks, leading incident response, improving platform resilience, mentoring others, reducing operational toil through automation, and using AI tools responsibly to accelerate analysis, documentation, monitoring, debugging, remediation, and continuous improvement.
Roles & Responsibilities:
Reliability & Operational Ownership
- Own uptime, performance, scalability, and resilience of production web platforms and supporting infrastructure.
- Define and improve SLIs, SLOs, error budgets, HA, fault tolerance, DR, and graceful degradation.
- Lead capacity planning, identify single points of failure, and act as senior technical owner during high-severity incidents.
Performance Engineering & Scalability
- Lead optimization across application, infrastructure, database, caching, CDN, and edge layers (Redis, Varnish, Cloudflare or similar).
- Establish benchmarks, regression checks, dashboards; reduce technical bloat across code, dependencies, assets, and infrastructure.
- Align performance work with SEO, product, and commercial impact.
AI-Native Operations & Automation
- Lead safe, practical AI-assisted workflows for log analysis, incident investigation, runbook creation, monitoring, security triage, and postmortems.
- Automate repetitive ops via scripts, IaC, and AI-assisted tooling; build anomaly detection, alert triage, and operational reporting workflows.
- Create reusable prompts, playbooks, and templates; define guardrails for data sensitivity, access control, human approval, and auditability.
Monitoring, Observability & Incident Management
- Own monitoring/alerting across apps, infra, databases, caches, queues, CDNs, cloud services, and critical user journeys.
- Design actionable dashboards and alerts that reduce noise and improve MTTD/MTTR.
- Lead incident response, RCA, postmortems, and preventive actions; mentor on troubleshooting and calm communication under pressure.
Security, Resilience & Platform Hardening
- Own production security posture: WAF, SSL, vulnerability management, malware/bot mitigation, threat detection, and remediation.
- Harden servers, databases, cloud, containers, CI/CD, secrets, and production access; manage secure dependency and patching processes.
- Maintain backup, recovery, and DR practices; contribute to security incident response, containment, and prevention.
Infrastructure, Cloud & Platform Engineering
- Design and operate hosting/runtime environments across varied stacks (web/app servers, databases, caches, queues, containers, cloud).
- Automate backups, updates, deployments, provisioning, and health checks using Ansible, Terraform, Docker, Kubernetes, Jenkins, GitHub Actions, or similar.
- Support AWS, GCP, Azure, or modern managed hosting; set infrastructure standards balancing reliability, security, performance, and cost.
DevOps, Release Engineering & Developer Enablement
- Lead CI/CD design, staging environments, rollback strategies, progressive delivery, and deployment observability.
- Partner with developers to embed reliability, performance, and security into the SDLC; build tooling and runbooks for safer shipping.
Documentation, Collaboration & Technical Leadership
- Maintain runbooks, troubleshooting guides, architecture notes, and operational playbooks (AI-assisted where useful, technically validated).
- Act as senior technical partner to engineering, product, SEO, and business stakeholders; mentor engineers and shape ops standards.
Requirements:
Preferred Experience
- 6+ years in SRE, DevOps, Infrastructure, Platform, or Security Engineering.
- Operating high-traffic web platforms, SaaS, SEO/content-heavy, affiliate, publishing, media, or e-commerce environments.
- Cloudflare, edge caching, WAF, CDN optimization, bot mitigation; AI-assisted ops or agentic engineering workflows.
- Leading high-severity incident response; defining SLOs, postmortems, runbooks; FinOps / cloud cost optimization.
- Certifications in AWS, GCP, Azure, Linux, Kubernetes, or security are a plus.
Required
- Senior-level experience in SRE, DevOps, infrastructure, platform engineering, or production operations.
- Proficiency in Python, Bash, PHP, JavaScript/TypeScript, Go, or similar; strong Linux server administration.
- Experience with web/app servers, databases, caches, queues, CDNs, cloud (AWS/GCP/Azure), and production traffic flows.
- Strong Git, CI/CD, deployment automation, rollback, and release management; solid DNS, SSL, networking, and load balancing fundamentals.
- Proven ability to troubleshoot complex production issues using logs, metrics, traces, and profiling—and to own systems without close supervision.
AI-Native Skills
- Practical use of AI for debugging, documentation, scripting, analysis, and workflow automation, with strong judgment on validation.
- Ability to design safe, human-in-the-loop AI workflows and reusable prompts/playbooks; sound judgment on privacy, access, and data sensitivity.
Performance & Observability
- Hands-on with Datadog, New Relic, Grafana, Prometheus, Cloudflare Analytics, OpenTelemetry, Lighthouse, WebPageTest, or similar.
- Strong grasp of caching, DB tuning, asset optimization, front-end and backend performance, edge delivery, SLOs/SLIs.
Security & Resilience
- Production security practices: access control, WAF, vulnerability management, secrets, patching, incident response.
- Backup strategy, recovery testing, DR planning; bot mitigation, dependency risk, malware detection, threat monitoring.
Automation & DevOps
- Ansible, Terraform, Jenkins, Docker, Kubernetes, GitHub Actions; IaC, containerization, orchestration, configuration management.
Communication & Leadership
- Calm incident leadership; clear technical communication to technical and non-technical stakeholders; mentoring and knowledge sharing.
Success in This Role Looks Like
- Platforms are faster, more reliable, and more secure; monitoring is actionable and incidents are managed calmly with meaningful follow-up.
- Manual work shrinks through automation and AI-assisted workflows; developers ship more safely; risks are caught before they impact the business.
Our Benefits:
We offer a competitive salary, and the opportunity to work with a talented and passionate team in a fast-paced, dynamic environment.