Lead Platform/Site Reliability Engineer

IO TECH SOLUTIONS LIMITED

Hong Kong
Permanent
Full-time

29 days ago
Apply easily

What You'll Do:As a Lead SRE, you'll be instrumental in shaping our systems' future. Your responsibilities will include:

System Reliability Leadership: Develop and execute strategies to achieve unparalleled service reliability and availability. You'll implement cutting-edge best practices, design resilient monitoring solutions, and conduct comprehensive failure injection and failover testing.
Advanced Automation: Spearhead automation initiatives to streamline complex operational tasks, enhancing efficiency and reducing manual interventions.
You'll advocate for treating "operations as a software problem" throughout the organization.
Comprehensive Monitoring & Performance: Design and maintain advanced monitoring and alerting systems to assess system health, performance, and user experience. You'll conduct in-depth analysis of metrics and logs to proactively identify and resolve complex issues.
Incident Management & Prevention: Lead during critical incidents, ensuring rapid resolution and clear communication. You'll conduct thorough post-mortem analyses, implement sustainable solutions, and share insights to prevent recurrence.
Expect to participate in on-call rotations as a primary escalation point.
Strategic Collaboration: Work closely with development and operations teams to embed reliability principles throughout the software development lifecycle.
You'll provide expert guidance, promote SRE best practices, and foster a culture of shared ownership for system reliability.
Capacity Planning & Optimization: Monitor and analyze system capacity and
performance data, forecast future demands, and lead efforts to scale infrastructure efficiently to meet growth.
Continuous Improvement & Innovation: Identify areas for systemic improvement in systems, tools, and processes. You'll lead the design and implementation of innovative solutions to enhance reliability, performance, and operational efficiency.
Mentorship & Leadership: Provide technical leadership and mentorship to SREs and other team members, fostering growth and skill development. You'll also contribute to hiring and onboarding processes for new team members.

What You'll Bring:

We're looking for a highly experienced and passionate SRE leader with:
12+ years of experience in Site Reliability Engineering, DevOps, or a related critical
operations role, with a proven track record of leading significant reliability initiatives.
A Bachelors degree in Computer Science, Engineering, or a related technical field, or equivalent extensive practical experience.
Exceptional proficiency in scripting and programming languages (e.g., Python, Go, Java, Ruby, Bash) for developing advanced automation, tooling, and system
integrations.
Extensive hands-on experience with major cloud platforms (e.g., AWS, Google Cloud Platform, Azure) and deep expertise in containerization technologies (Docker, Kubernetes).
Profound understanding of Linux/Unix systems internals, networking protocols, and distributed system architectures.
Expertise in designing and managing CI/CD pipelines and robust version control systems (e.g., Git), advocating for GitOps principles.
Mastery of monitoring, logging, and alerting tools (e.g., Datadog, Prometheus, Grafana, ELK stack, OpenTelemetry).
Superior problem-solving skills, critical thinking, and meticulous attention to detail, especially under pressure.
Outstanding communication, interpersonal, and collaboration skills, with the ability to influence and lead cross-functional teams.
Proven ability to thrive and lead in a fast-paced, highly dynamic, and complex technical environment.
Expert-level debugging and root cause analysis capabilities across complex distributed systems.

Bonus Points For:

Extensive experience with infrastructure as code (IaC) tools (e.g., Terraform, Ansible, Pulumi).
Deep knowledge of various database systems (relational and NoSQL) and advanced data management strategies.
Significant experience designing, implementing, and operating microservices architectures.
Contributions to open-source projects related to SRE, operations, or cloud-native technologies.
This role offers a unique opportunity to make a significant impact on our core services and directly influence our engineering culture around reliability.

IO TECH SOLUTIONS LIMITED