Tagcor

Required Skills

Site Reliability Engineer

Work Authorization

US Citizen
Green Card
EAD (OPT/CPT/GC/H4)
H1B Work Permit

Preferred Employment

Corp-Corp
W2-Permanent
W2-Contract
Contract to Hire

Employment Type

Consulting/Contract

education qualification

UG :- - Not Required
PG :- - Not Required

Other Information

No of position :- ( 1 )
Post :- 3rd Oct 2023

JOB DETAIL

· Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.

· Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.

· Partner with business and technical product owners to set SLOs / SLIs / error budgets to manage reliability of infrastructure and applications

· Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency.

· Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence

· Maintain infrastructure (infrastructure as code) and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages, and security threats in Development, UAT, Staging and Production environments.

· Practice sustainable incident response and blameless postmortems

· Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems

· Preserve operational visibility and response capabilities — fixing and improving our dashboards, alerts, and automation.

· Maintain operational uptime and reliability by participating in triage and issue support calls for mission critical systems.

· Monitoring service-level indicators (SLIs). An SLI could be the number of successful requests out of total requests. Having a high SLI, in this case, would be a target. SREs track other metrics such as availability, uptime performance, latency, error count and throughput. Regularly monitoring systems is essential to ensure proper resource utilization of containers and to avoid out-of-memory (OOM) errors.

· Setting SLOs and SLAs and determining error budgets. Once you have determined baseline system performance, you can set service-level objectives (SLOs). These are typically internal targets like 99.99% availability. While SREs typically oversee functional metrics, some teams set goals for non-functional metrics, as well. SREs help determine service-level agreements (SLAs), which are more legally binding and typically partner-facing.

· Responding to incidents. On-call SREs will be tasked with finding the root cause of issues as they arise. When triaging an incident, it’s helpful to have all the necessary logs and tools immediately at hand. This is one area where automation can assist by pulling relevant details to instantly build a case, said Curtis.

· Writing postmortems. After an incident has been dealt with, it’s important to learn from it. Postmortems are common in cybersecurity practice and often fall under the responsibility of an SRE. These reviews seek to answer set criteria to get to the heart of an incident and identify the root cause(s) of an issue to prevent it from happening again.

Quals--
Bachelor’s degree in design, computer science, or a related technical field

· Strong debugging, troubleshooting, and problem-solving skills

· Proficient in Nodejs, familiarity with other scripting languages is a plus: JavaScript, Python, Maven, Ansible, Bash, etc.

· Experience with monitoring and alerting systems like Dynatrace, Prometheus, Grafana.

· Experience with logs and metrics analytics platforms like Sumologic, Splunk

· Experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications using Kubernetes, AWS Native components, CloudWatch, Dynatrace.

· Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible

· Proven history of leveraging automation

· Experience using tools like PagerDuty for managing incidents.

· Understanding of standard networking protocols and components such as HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies

· Experience in Serverless Application Framework

· Experience in containerized workloads and management platforms such as Docker or Kubernetes

· Familiarity with distributed systems is a plus including Microservices.

· Experience in Infrastructure automation tools such as CDK

· Understanding of CI/CD processes and experience with deployment automation tools such as Code Pipeline, Code Deploy, Jenkins, Bamboo

· Effective communication, collaboration & negotiation skills with the ability to interface with various business units and vendors.

· Experience liaising with developers, operations engineers, and third-party resources.

· Experience consuming APIs.

Soft Skills

· Ability to work in a team and independently.

· Excellent verbal and written communication skills

· Multitasking

· Time management

Senior Site Reliability Engineer