US Citizen
Green Card
Corp-Corp
W2-Permanent
W2-Contract
Contract to Hire
Consulting/Contract
UG :- - Not Required
PG :- - Not Required
No of position :- ( 1 )
Post :- 13th Sep 2025
We are seeking a skilled Site Reliability Engineer (SRE) to join our team, ensuring the stability, scalability, and reliability of our systems and applications. The ideal candidate will have a blend of software engineering and operations expertise, dedicated to automating infrastructure, managing complex distributed systems, and enhancing the overall performance and reliability of our platforms.
What are the day-to-day responsibilities?
1. Incident Management and Troubleshooting
* Proactively monitor systems using tools like Splunk, Prometheus, and Grafana, and respond to incidents swiftly.
* Conduct thorough root cause analysis (RCA) and root cause corrective action (RCCA) to identify, resolve, and prevent recurrence of incidents.
2. Data-Driven Analysis and Reporting
* Utilize SQL, BigQuery, and data analytics skills to generate reports, track metrics, and drive data-informed decisions to improve system reliability.
* Conduct regular performance reviews and create reports on system health and incident trends for continuous improvement.
3. System Reliability and Optimization
* Design, implement, and maintain reliable systems and applications, ensuring optimal performance and minimizing downtime.
* Develop and enforce SLOs, SLIs, and SLAs to meet availability targets for critical systems.
Key Responsibilities:
1. System Reliability and Optimization
* Design, implement, and maintain reliable systems and applications, ensuring optimal performance and minimizing downtime.
* Develop and enforce SLOs, SLIs, and SLAs to meet availability targets for critical systems.
2. Incident Management and Troubleshooting
* Proactively monitor systems using tools like Splunk, Prometheus, and Grafana, and respond to incidents swiftly.
* Conduct thorough root cause analysis (RCA) and root cause corrective action (RCCA) to identify, resolve, and prevent recurrence of incidents.
3. Automation and Infrastructure as Code (IaC)
* Automate repetitive tasks and processes to increase efficiency, reduce manual intervention, and enhance system reliability.
4. Capacity Planning and Scalability
* Monitor system performance and plan for capacity and scalability, ensuring resources are effectively managed as the organization grows.
* Analyze system metrics to detect trends and anticipate future scaling needs.
5. Disaster Recovery and Business Continuity
* Create and test disaster recovery plans, coordinating backup and restoration processes to minimize downtime.
* Work with stakeholders to ensure systems are prepared for unexpected events, with clear, documented procedures.
6. Collaboration and Documentation
* Collaborate with development teams to align reliability with application design, deployment, and maintenance.
* Document key processes, troubleshooting guidelines, and best practices to support knowledge sharing and onboarding.
7. Data-Driven Analysis and Reporting
* Utilize SQL, BigQuery, and data analytics skills to generate reports, track metrics, and drive data-informed decisions to improve system reliability.
* Conduct regular performance reviews and create reports on system health and incident trends for continuous improvement.
Preferred Qualifications:
* Experience with cloud platforms (AWS, GCP, or Azure) * Proficiency with monitoring and observability tools
* Strong analytical skills and experience with automation tools and scripting languages (Python, Bash, etc.)