Tagcor

Required Skills

Site Reliability Engineer

Work Authorization

US Citizen
Green Card
EAD (OPT/CPT/GC/H4)
H1B Work Permit

Preferred Employment

Corp-Corp
W2-Permanent
W2-Contract
Contract to Hire

Employment Type

Consulting/Contract

education qualification

UG :- - Not Required
PG :- - Not Required

Other Information

No of position :- ( 1 )
Post :- 16th Oct 2025

JOB DETAIL

1. Advanced Kubernetes – Must have strong skills in Kubernetes at scale using one of GKE, AKS, EKS or RKE. Experience with Kubectl and Helm.
2. Containers - Experience deploying Java (Spring Boot) microservices in dockerized environments.
3. Observability – Experience in setting up tools like Prom/Grafana, Datadog, AppDynamics, Splunk. to give actionable intel on a microservice environment including but not limited to synthetics, Application performance monitoring, logging and Alerting (Pagerduty/OpsGenie Integrations).
4. Good CI/CD expertise - Jenkins, Azure DevOps, Github Actions, ArgoCD, Artifactory, Azure container registry, Google container registry and other similar tooling.
5. SCM - Working with tools like Github/Gitlab for source code management and well as experience with branching strategies like GitFlow and trunk based.

Job Summary:
We are looking for a seasoned Site Reliability Engineer to augment our team to support its strategy of driving products and technology into everything they deliver to accelerate the growth in business. As a SRE, you'll work as part of a team of problem solvers, helping to solve complex business issues from strategy to execution.

The team covers a variety of responsibilities that are executed by DevSecOps, Site Reliability and ML Ops Engineers, including:
· Defining standard reliability and resilience for infrastructure and application components.
· Proactive optimization of redundancies, monitoring and alerting practices and patterns
· Developing resilient and highly available distributed systems.
· Infrastructure as Code development for building cloud tools.
· Secrets and configuration management
· Monitoring systems and services, providing incident and emergency response to triage and resolve system or client issues
· Management of the application ecosystem improving platform infrastructure and applications with high reliability, resiliency, performance, and quality
· Supporting documentation, knowledge articles, and runbooks
· Designing, building, and Implementing SRE patterns that adhere to our client’s security