Define and lead the implementation of SRE best practices, including SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets to ensure service reliability and performance.
Contribute to scalable, robust, and highly available infrastructure across on-premise, cloud, or hybrid environments (GCP, Kubernetes).
Oversee incident management and root cause analysis processes, driving swift resolution and fostering a blameless post-mortem culture.
Develop and refine CI/CD pipelines and automated deployment strategies to achieve faster, more reliable releases.
Implement and maintain monitoring, observability, and alerting solutions (e.g., Prometheus, Grafana, Splunk) to provide actionable insights and proactive issue detection.
Collaborate with cross-functional teams (Development, QA, Product, and Security) to ensure new features and services meet reliability and performance standards from design through production.
Mentor and coach junior engineers, promoting a culture of knowledge-sharing, continuous improvement, and technical excellence.
Continuously evaluate and adopt new tools and technologies that can enhance system reliability, scalability, and developer productivity.