Lead Platform/Site Reliability Engineer
IO TECH SOLUTIONS LIMITED
- Hong Kong
- Permanent
- Full-time
- System Reliability Leadership: Develop and execute strategies to achieve unparalleled service reliability and availability. You'll implement cutting-edge best practices, design resilient monitoring solutions, and conduct comprehensive failure injection and failover testing.
- Advanced Automation: Spearhead automation initiatives to streamline complex operational tasks, enhancing efficiency and reducing manual interventions.
- You'll advocate for treating "operations as a software problem" throughout the organization.
- Comprehensive Monitoring & Performance: Design and maintain advanced monitoring and alerting systems to assess system health, performance, and user experience. You'll conduct in-depth analysis of metrics and logs to proactively identify and resolve complex issues.
- Incident Management & Prevention: Lead during critical incidents, ensuring rapid resolution and clear communication. You'll conduct thorough post-mortem analyses, implement sustainable solutions, and share insights to prevent recurrence.
- Expect to participate in on-call rotations as a primary escalation point.
- Strategic Collaboration: Work closely with development and operations teams to embed reliability principles throughout the software development lifecycle.
- You'll provide expert guidance, promote SRE best practices, and foster a culture of shared ownership for system reliability.
- Capacity Planning & Optimization: Monitor and analyze system capacity and
- performance data, forecast future demands, and lead efforts to scale infrastructure efficiently to meet growth.
- Continuous Improvement & Innovation: Identify areas for systemic improvement in systems, tools, and processes. You'll lead the design and implementation of innovative solutions to enhance reliability, performance, and operational efficiency.
- Mentorship & Leadership: Provide technical leadership and mentorship to SREs and other team members, fostering growth and skill development. You'll also contribute to hiring and onboarding processes for new team members.
- We're looking for a highly experienced and passionate SRE leader with:
- 12+ years of experience in Site Reliability Engineering, DevOps, or a related critical
- operations role, with a proven track record of leading significant reliability initiatives.
- A Bachelors degree in Computer Science, Engineering, or a related technical field, or equivalent extensive practical experience.
- Exceptional proficiency in scripting and programming languages (e.g., Python, Go, Java, Ruby, Bash) for developing advanced automation, tooling, and system
- integrations.
- Extensive hands-on experience with major cloud platforms (e.g., AWS, Google Cloud Platform, Azure) and deep expertise in containerization technologies (Docker, Kubernetes).
- Profound understanding of Linux/Unix systems internals, networking protocols, and distributed system architectures.
- Expertise in designing and managing CI/CD pipelines and robust version control systems (e.g., Git), advocating for GitOps principles.
- Mastery of monitoring, logging, and alerting tools (e.g., Datadog, Prometheus, Grafana, ELK stack, OpenTelemetry).
- Superior problem-solving skills, critical thinking, and meticulous attention to detail, especially under pressure.
- Outstanding communication, interpersonal, and collaboration skills, with the ability to influence and lead cross-functional teams.
- Proven ability to thrive and lead in a fast-paced, highly dynamic, and complex technical environment.
- Expert-level debugging and root cause analysis capabilities across complex distributed systems.
- Extensive experience with infrastructure as code (IaC) tools (e.g., Terraform, Ansible, Pulumi).
- Deep knowledge of various database systems (relational and NoSQL) and advanced data management strategies.
- Significant experience designing, implementing, and operating microservices architectures.
- Contributions to open-source projects related to SRE, operations, or cloud-native technologies.
- This role offers a unique opportunity to make a significant impact on our core services and directly influence our engineering culture around reliability.