
Site Reliability Engineer III, Margin Services
- Hong Kong
- Permanent
- Full-time
As a Site Reliability Engineer III team member in Securities Services Technology, you will ensure the operational stability, availability, and performance of our production application flows. Encourage a culture of continuous improvement as you troubleshoot, maintain, identify, escalate, and resolve production service interruptions for all internally and externally developed systems, leading to a seamless user experience.Job responsibilities
- Provide business facing technology support to business and operations groups across the Asia Pacific region.
- Work within a follow-the-sun support model with global counterparts.
- Manage production technology incidents to resolution, ensuring timely engagement, escalation and effective communication to business, technology, and vendor partners.
- Perform post incident analysis, identifying, tracking and implementing preventative measures.
- Act as Subject Matter Expert (SME) for key applications, responsible for maintaining global best practice and hygiene standards.
- Assist in the monitoring of production environments for anomalies and address issues utilizing standard observability tools.
- Act as a key contributor in the continued development of tools, frameworks & techniques to improve productivity and quality of the production support, adopting SRE principles to manage and support the environment.
- Analyze complex situations and trends to anticipate and solve incident, problem, and change management in support of full stack technology systems, applications, or infrastructure.
- 6+ years previous experience in Technology for Financial and Banking sector, with expertise in troubleshooting, resolving, and maintaining information technology services.
- Bachelor's degree in Engineering, Computer Science, or Information Technology.
- Proven track record of Production Support & Site Reliability Engineer (SRE): A clear understanding of SRE protocols and methodologies
- Familiar with observability, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others.
- Support Management skills: design and use monitoring dashboards for day-to-day support, generate service KPIs, report on service stability & performance and log monitoring.
- Proven track record of running Incident & Problem Management calls for business impacting outages, performing post incident analysis, identifying & implementing preventative measures and lessons learned following outages.
- Excellent interpersonal relationship and communication skills, along with strong analytical and problem-solving skills.
- Able to drive issue resolution across different support teams.
- Experience in debugging and maintaining applications in a large corporate environment with one or more modern programming languages and database querying languages.
- Passion for learning new technologies and driving innovative solutions.
- Experience with Kubernetes for container orchestration.
- Scripting languages: Perl, Python, Linux/UNIX shell and Database: Oracle, MS-SQL, PostgreSQL, no SQL DB (Casandra)
- Telemetry & Application Performance monitoring tools such as: Splunk, AppDynamics, Dynatrace, Grafana, ITRS Geneos
- Experience with core AWS services such as: EC2, S3, EKS, RDS, Cloudwatch (& DataDog)
- Exposure to agile methodologies such as Continuous Integration (CI) and Continuous Delivery (CD) tools like Jenkins and Terraform
- Experience in technology disaster recovery planning and test execution and prior experience in JAVA development is a plus.