Required Skills

Site Reliability Engineer

Work Authorization

  • US Citizen

  • Green Card

Preferred Employment

  • Corp-Corp

  • W2-Permanent

  • W2-Contract

  • Contract to Hire

Employment Type

  • Consulting/Contract

education qualification

  • UG :- - Not Required

  • PG :- - Not Required

Other Information

  • No of position :- ( 1 )

  • Post :- 21st Nov 2024

JOB DETAIL

1. System Reliability and Optimization

* Design, implement, and maintain reliable systems and applications, ensuring optimal performance and minimizing downtime.

* Develop and enforce SLOs, SLIs, and SLAs to meet availability targets for critical systems.


2. Incident Management and Troubleshooting

* Proactively monitor systems using tools like Splunk, Prometheus, and Grafana, and respond to incidents swiftly.

* Conduct thorough root cause analysis (RCA) and root cause corrective action (RCCA) to identify, resolve, and prevent recurrence of incidents.


3. Automation and Infrastructure as Code (IaC)

* Automate repetitive tasks and processes to increase efficiency, reduce manual intervention, and enhance system reliability.


4. Capacity Planning and Scalability

* Monitor system performance and plan for capacity and scalability, ensuring resources are effectively managed as the organization grows.

* Analyze system metrics to detect trends and anticipate future scaling needs.


5. Disaster Recovery and Business Continuity

* Create and test disaster recovery plans, coordinating backup and restoration processes to minimize downtime.

* Work with stakeholders to ensure systems are prepared for unexpected events, with clear, documented procedures.


6. Collaboration and Documentation

* Collaborate with development teams to align reliability with application design, deployment, and maintenance.

* Document key processes, troubleshooting guidelines, and best practices to support knowledge sharing and onboarding.

 

7. Data-Driven Analysis and Reporting

* Utilize SQL, BigQuery, and data analytics skills to generate reports, track metrics, and drive data-informed decisions to improve system reliability.

* Conduct regular performance reviews and create reports on system health and incident trends for continuous improvement.

Company Information