Design and implement highly automated systems/services that ensure the availability, reliability, and scalability of infrastructure and applications.
Build and maintain monitoring and alerting to provide timely feedback on the performance and health of systems, network, and applications.
Design and implement automation tools to reduce manual toil, streamline repetitive tasks, and enhance overall operational efficiency.
Design and build Service Level Indicator (SLIs) metrics, including but not limited to Service Level Objectives (SLOs), Error Budget, Burn Rate Alerts
Work closely with development teams to embed reliability best practices into the software development process. Provide mentorship and training to cross-functional teams on SRE principles, encouraging a shared responsibility for the reliability of our services.
Collaborating with our support, operations and engineering teams to investigate and troubleshoot complex problems
Observe and monitor systems to make sure you have the insight into system performance, health, availability and what is happening internally in the system.
Understands what to monitor based on the system(s) you are managing, how the monitoring data is stored, and how to look at the data to make determinations about future actions.
Participates in continuous improvement efforts that span multiple multi-functional domains and informs the generation of new standards
Be a part of an on-call rotation, continuously enhance automation & documentation, and mentor others on the standard methodologies of infrastructure automation to encourage adoption.
Able to overcome differences of opinion and drive team alignment around a specific goal or solution
Holds associates and teams accountable for adhering to practices and policies