The first 6 months will likely be heavy on the overall design and laying out design/code patterns and testing strategy. As we get into building, doing code reviews and oversight. When testing starts, ensuring the monitoring and alerting that will be built by the CDS SRE team are adequate and that the results meet the design goals. This won't be doing performance testing. This person will not be building dashboards or doing anything in conflict with CDS SRE team.
Pushed for Gremlin chaos testing tool. We're doing a POC now. Have they used chaos tools to validate resilience design? Trading sees this stuff all day where there's a slowdown or drop. All my new designs are intended to handle anything and automatically switch/recover/circuit break… These folks should be looking at every box and line in those designs and driving discussion and coding to address resilience at every step.
· Doing a migration for MQ to Kafka, so Kafka is going to be a requirement