Leading the AMS card operations team towards SRE and Automation Efforts for Fraud Authorization and authentication area.
Responsible for creating solutions for Observability by building APP dynamics single-glass-pane dashboard for both Fraud Authorization and authentication Area (FAA) and Collections Value stream (CVS).
Build single glass pane Dashboards in Datadog for Collections Value stream.
Worked on migration of alerts for Collection Value Stream alerts from app dynamics to Datadog.
Work on migration of dashboards from AppDynamics to Datadog and created monitors for Recovery application in Datadog.
Create Service now reports for incidents, RRT’s Problem tickets, Deployment reports for tracking and leadership visibility for FAA and CVS area.
Create Aggregator framework for Incident monitoring to reduce the false positives and reduction in noise generated incidents. Reduced 1500 False positive tickets.
Work on creating automation for IRIS health check and designed framework for automated reporting database issues to recovery DBA team.
Created Runbooks and postmortem reports for all the alerts created and RRT’s.
Navigation of issue triaging would take 30-40 minutes and after observability framework is implemented the AIL MTTD improved by 85% by bringing down issue detection to 5-10 minutes..
Created SRE roadmap on Enterprise level, introduced SLI/SLO concepts on component level, Gremlin on Chaos Engineering.
Scrum Master for SRE sprint plans, risk identification and mitigation, capacity and velocity planning.
Working on Creating chaos engineering scenarios and assisting team in getting gremlin agents installed on pre-prod servers.
Guided team in implementing automation for critical automation like SAS rule validation and deployment.