Monitor, resolve system errors, disruptions. Document resolution. Manage Incident as per ITIL lifecycle. Liaise with upstream dataproviders to resolve issues. Respond to and solve inquiries and operations requested by users. Document, Review, handling and resolution steps for support scenarios.
Prepare and present stability reports and presentations. Analyze Alert and Stability trends and make recommendations. Investigate root cause of the issues, Inform and educate developers about the cause so that developers and mitigate the root cause.
Automate (1) Resolution of common problems (2) Routine investigations (3) Routine user requests using scripts or available programming platform. Lead reliability or business driven projects. Perform reliability engineering
You will work closely with engineering/development teams to design, build, and maintain systems and help them decide on products to use, schema design and query tuning.
You will troubleshoot issues across the entire stack: hardware, software, application, and network.
You will mentor other SREs on standard methodology for everything from monitoring to troubleshooting complex code and database issues.
You will identify and drive opportunities to improve automation for the company; scope and create automation for deployment, management, and visibility of our services.
Represent the SRE organization in design reviews and operational readiness exercises for new and existing services.
Participate in on-call rotation and periodic conference calls with other specialists from other time zones.