Design, build and maintain scalable and robust infrastructure for AI/ML systems, including cloud-based environments, containerization and orchestration platforms.
Develop and implement CI/CD pipelines to automate the deployment, testing and monitoring of AI/ML models and applications.
Collaborate with data scientists, data engineers and software engineers to optimize model training, deployment and inference pipelines.
Monitor and troubleshoot AI/ML systems to ensure high availability, performance and reliability.
Maintain and monitor model training and inference pipelines across multi-cloud tenants especially around Large Language Models (LLMs).
Maintain Kubernetes pods, container registry and virtual machine image library and model registry
Monitor infrastructure utilization and costs pertaining to model training, inference and GPU utilization
Implement best practices for security, data privacy and compliance in AI/ML workflows and infrastructure.
Evaluate and integrate new tools, technologies and frameworks to improve the efficiency and effectiveness of our MLOps processes.
Mentor and provide technical guidance to junior members of the organization.
Stay up-to-date with the latest advancements and trends in MLOps, DevOps and cloud technologies and share them with the team.
Education Qualifications
Bachelor’s or higher degree in Computer Science, Engineering or a related field