SysOps Engineer – Monitoring & Cloud Operations in India at Jobgether
Explore Related Opportunities
Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a SysOps Engineer – Monitoring & Cloud Operations in India.
This role sits at the core of mission-critical infrastructure operations, ensuring the stability, performance, and resilience of large-scale cloud and hybrid systems. You will be responsible for continuously monitoring production environments, identifying and resolving incidents, and maintaining high availability across distributed services. Working within a fast-paced engineering organization, you will collaborate closely with cloud, DevOps, and DataOps teams to safeguard system health and optimize performance. The environment is highly production-driven, requiring strong operational discipline, rapid troubleshooting skills, and a proactive mindset toward risk prevention. You will play a key role in designing and maintaining observability frameworks, ensuring that alerts, dashboards, and monitoring tools provide actionable insights. This is a high-impact position where your work directly supports system uptime, service reliability, and business continuity.
- Monitor infrastructure and production systems using observability tools such as New Relic, Prometheus, Grafana, or similar platforms, ensuring full visibility into system health.
- Configure and maintain alerts, dashboards, and service-level monitoring to proactively detect anomalies and prevent incidents.
- Lead incident management activities including troubleshooting, root cause analysis (RCA), and post-incident reporting.
- Ensure system uptime, performance, and SLA compliance across cloud and on-premise environments.
- Manage operating system-level tasks (Linux and Windows), including patching, tuning, and service management.
- Oversee backup processes and regularly validate restoration procedures to ensure data reliability.
- Execute and support disaster recovery (DR) plans, including failover/failback testing and DR drills across environments.
- Collaborate with DataOps and infrastructure teams to ensure replication integrity, system resilience, and business continuity readiness.
- Perform capacity planning, performance optimization, and infrastructure health assessments.
- Maintain operational documentation, including runbooks, monitoring guidelines, and incident playbooks.
- Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent practical experience.
- Proven experience in SysOps, Cloud Operations, SRE, or Infrastructure Support roles in production environments.
- Strong hands-on experience with Linux and Windows system administration.
- Experience using monitoring and observability tools such as New Relic, Prometheus, Grafana, Datadog, or equivalent solutions.
- Solid understanding of incident management, problem management, and root cause analysis methodologies.
- Experience working with cloud platforms such as AWS, Azure, or Google Cloud Platform.
- Strong knowledge of disaster recovery, backup strategies, and business continuity planning.
- Familiarity with infrastructure components such as virtual machines, compute instances, and physical servers.
- Understanding of web and system services such as Nginx, IIS, and systemd.
- Strong analytical and troubleshooting skills with the ability to resolve complex production issues under pressure.
- Excellent communication and collaboration skills for cross-functional coordination.
- Experience in high-availability, mission-critical environments is highly preferred.
- Competitive compensation package aligned with experience and market standards.
- Fully remote work environment with flexible arrangements.
- Opportunity to work on large-scale, mission-critical infrastructure systems.
- Exposure to modern cloud technologies and advanced observability platforms.
- Professional growth in a fast-paced, high-impact engineering organization.
- Collaborative and cross-functional team culture.
- Involvement in disaster recovery planning, system resilience design, and cloud operations at scale.