Maintain infrastructure and ensure resource availability across these clusters.
Provide monitoring tools to ensure that the processing jobs are running on schedule, auto-mitigate known issues, raise ICM alerts, etc.
Ensure quality of issues reported to the engineering team.
Automate initial diagnostics performed on new Incident tickets in ICM that are routed to the team and perform mitigation steps for common issues using tools like ACIS.
Own all aspects of the current toolsets from monitoring for any unexpected perf/scale/reliability issues to addressing any improvements needed as requested by various engineering teams.
Provide self-service tools for partners to unblock themselves, and help in resolving incidents by troubleshooting problems etc.