- 5+ years managing and monitoring Incident/Crisis management
- 3+ years’ experience monitoring with various tools like Grafana, NewRelic etc.
- 1+ years’ experience programming in a programming language such as Python and Go
- Infrastructure as Code and Terraform
- On call experience
- Lead the on-call teams and processes to improve site reliability
- Focus on managing large scale systems with high loads 24/7
- Support our SRE and engineering teams in their day to day
- Build, enhance and maintain runbooks working with various teams cross-functionally
- Thrive on automating processes as much as possible
- Observability and Monitoring with services like Prometheus, Grafana, New Relic
- Additional other duties and responsibilities, as assigned.
- Lead the NOC tools, runbooks, processes and teams
- Automation of runbooks as necessary
- Work with our development teams on improving the system
- Attention to detail and ability to manage multiple projects
- Strong analytical skills and ability to present complex data on site reliability and other factors
- Demonstrated ability to work with 3rd parties and collaborate on solutions.
- Experience in Monitoring using NewRelic/Grafana/Prometheus.
- Experienced in scripting languages Python/Go
Required Skills:
1.Cloud Concepts: Extensive hands-on experience in AWS/GCP, Kubernetes
2. Infrastructure as code : Terraform, CI/CD: GitHub, Jenkins
3. Monitoring Tools: Experience in using New Relic, Grafana, Prometheus.
Required Education: At least a Bachelor’s Degree (or equivalent experience) in Computer Science, Software/Electronics/Electrical Engineering, Information Systems or closely related field is required.
Necessary Skills: AWS, GCP, Kubernetes ,Terraform