Required Skills

Site Reliability Engineer Splunk ServiceNow Azure

Work Authorization

  • US Citizen

  • Green Card

Preferred Employment

  • Corp-Corp

  • W2-Permanent

  • W2-Contract

  • Contract to Hire

Employment Type

  • Consulting/Contract

education qualification

  • UG :- - Not Required

  • PG :- - Not Required

Other Information

  • No of position :- ( 1 )

  • Post :- 22nd Sep 2025

JOB DETAIL

•             Cloud Strategy

•             Provide thought leadership, mentorship, and technical vision related to site reliability, DevOps, and a ‘cloud-first’ culture.

•             Analyze and implement cloud services to meet business goals, focusing on cost optimizations, efficiencies, and scalability.

•             Drive orchestration efforts for cloud services, design self-service aspects, and stay updated with emerging cloud technologies.

•             Infrastructure Automation and Design

•             Collaborate on designing, building, and maintaining scalable infrastructure across cloud and on-prem environments.

•             Automate provisioning and configuration using tools like Terraform, Terragrunt, and Puppet.

•             Develop automation scripts, maintain CI/CD pipelines, and plan for scalability and capacity, conducting load testing as needed.

•             Reliability and Performance Engineering

•             Ensure system reliability, availability, and performance through monitoring, alerting, and incident response.

•             Implement and manage SLOs/SLIs to meet reliability standards.

•             Identify and address performance bottlenecks across the infrastructure and application stack.

•             Build and maintain observability solutions (e.g., monitoring, logging, and tracing) and improve system health dashboards.

•             Security and Compliance:

•             Implement security measures for Cloud Native applications and ensure compliance with industry standards (SOC2, PCI, etc).

•             Collaborate with security teams to audit and monitor systems, continuously updating security configurations and dashboards.

•             Incident Management and Root Cause Analysis:

•             Participate in on-call rotations to provide 24/7 support for production environment.

•             Lead incident response activities and perform root cause analysis to prevent recurring incidents.

•             Conduct and document post-incident retrospectives (postmortems) to drive continuous improvement.

•             Create and Maintain runbooks and operational documentation for continuous improvement.

•             Proactively test system resilience through Chaos Engineering experiments and failure injection.

•             Disaster Recovery and Business Continuity

•             Design and test disaster recovery (DR) and business continuity strategies, ensuring backup and failover mechanisms are effective.

•             Cost Management and Financial Optimization

•             Monitor cloud usage and implement financial optimization practices (FinOps) to control infrastructure costs.

•             Collaborate with stakeholders to drive financial efficiency.

•             Collaboration, Knowledge Sharing, and Communication:

•             Collaborate across teams to ensure alignment and effective project implementation.

•             Communicate during incidents and changes, providing transparency to stakeholders.

•             Mentor and share knowledge with team members to foster a collaborative and continuous learning environment.

•             Maintain comprehensive documentation of system configurations, processes, and best practices.

Company Information