Conceptualize and implement Machine Learning driven Site Reliability Engineering Framework/Components to improve predictive monitoring and driving SRE team’s journey towards “Automation First” approach
Research latest technology, concepts, conceptualize solution and develop proof of concept that will improve resiliency and performance of the production infrastructure. Design and implement innovative solution/framework that will improve software engineering velocity, infrastructure resiliency and security, and data availability
Develop observability related common framework components (to be leveraged by enterprise applications), define standards for configuration, monitoring, reliability and performance engineering
Work with operations team to resolve major incidents related to observability Tools
Continuously improve automated remediation tasks to ensure the highest levels of availability
5+ years of experience with building Rest APIs, API Integration, and Web Services
Knowledge of server side technologies such as WebSphere, JBose, NodeJS
5 + years of experience in Python, with emphasis on machine learning
Hands on experience with – Spark, Splunk, Pandas, Numpy, and Scikitlearn
Experience in designing mission critical highly available enterprise applications
Hand on experience with performance testing framework design, tuning Java applications
Experience managing NoSQL databases such as Couchbase, MongoDB, etc.
Strong knowledge of Linux internals and experience managing Linux systems in high traffic environments
Strong knowledge of machine learning, mathematical modeling, and statistics
Strong interpersonal communication skills and the ability to work well in a diverse teamfocused environment