Required Skills

Prometheus Grafana ELK stack Aternity

Work Authorization

  • US Citizen

  • Green Card

  • EAD (OPT/CPT/GC/H4)

  • H1B Work Permit

Preferred Employment

  • Corp-Corp

  • W2-Permanent

  • W2-Contract

  • Contract to Hire

Employment Type

  • Consulting/Contract

education qualification

  • UG :- - Not Required

  • PG :- - Not Required

Other Information

  • No of position :- ( 1 )

  • Post :- 15th Dec 2023

JOB DETAIL

  1. Observability and Monitoring: 

  • Develop and implement robust observability strategies, including logging, metrics, and tracing, to gain deep insights into the performance and health of our systems. 

  • Collaborate with cross-functional teams to establish and enforce best practices for instrumentation, logging, and monitoring throughout the software development lifecycle. 

  1. Site Reliability Engineering: 

  • Lead initiatives to improve the reliability, availability, and scalability of our applications and infrastructure. 

  • Collaborate with development teams to design and implement systems that are resilient to failures and capable of quick recovery. 

  • Drive the adoption of SRE principles and practices across the organization. 

  1. Incident Management: 

  • Develop and refine incident response processes, ensuring timely detection, analysis, and resolution of incidents. 

  • Collaborate with teams to conduct post-incident reviews, identify root causes, and implement preventive measures. 

  1. Automation and Tooling: 

  • Build and maintain automation tools for deployment, monitoring, and incident response to streamline operational processes. 

  • Evaluate and integrate third-party tools to enhance observability and SRE capabilities. 

  1. Collaboration and Leadership: 

  • Provide technical leadership and mentorship to the engineering team. 

  • Collaborate with product managers, architects, and other stakeholders to align observability and SRE initiatives with business goals. 

Qualifications: 

  • Bachelor's or higher degree in Computer Science, Software Engineering, or a related field. 

  • Extensive experience in software engineering with a focus on observability, monitoring, and SRE. 

  • Strong expertise in designing and implementing distributed systems for high availability and reliability. 

  • Proficiency in APM (Application performance monitoring), RUM (Real user monitoring), Synthetics, correlation, alert & incident management (e.g., OTEL, Jaeger, Kloudfuse, service-now). 

  • Proficiency in one or more programming languages (e.g., Java, Python, Go). 

  • Experience with cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes). 

  • In-depth knowledge of observability tools and frameworks (e.g., Prometheus, Grafana, ELK stack, Datadog, Aternity) and incident management processes. 

  • In-depth knowledge of ML & AI frameworks (e.g., Anomaly, Outlier, AIOps, LLM). 

  • Excellent communication and collaboration skills. 

  • Demonstrated ability to lead technical initiatives and mentor team members. 

Preferred Qualifications: 

  • Certifications in relevant areas such as AWS Certified DevOps Engineer, Certified Kubernetes Administrator (CKA), or equivalent. 

  • Previous experience in a leadership or management role. 

  • Familiarity with Infrastructure as Code (IaC) tools such as Terraform, Packer & C Crossplane. 

Company Information