-Fleet monitoring & recovery of assets in our private cloud environment that houses several compute servers with NVIDIA GPUs.
-Specific focus on building and stabilizing our virtualization infrastructure of ESXi, KVM and Hyper-V.
-Deploy and maintain a large farm of machines using the latest Configuration Management & Infrastructure Automation tools (Chef, Ansible, Terraform).
-Participate in on-call & rotational L1 support for round-the-clock monitoring and remediation of infrastructure issues (PagerDuty)
-Analyze and Debug operating system, networking, configuration and performance problems.
-Assist in roll-out and deployment of infrastructure configurations to supporting the latest hardware and technologies.
-Contribute to the development of monitoring systems to have fast, reliable and real-time pulse of the various infrastructure subsystems (Zabbix, Big Panda, Grafana)
Apply now!
-Bachelor’s or Master’s Degree in Computer Science or Software Engineering, or equivalent experience.
-Good with system and platform debugging
-Virtualization experience (key match if available) - (vSphere, Hyper-V, KVM, Xen server)
-Familiar with Client Configuration tools (Chef (preferred), Ansible)
-Experience working in large scale enterprise production systems. -5+ years of professional experience required.
-Ability to debug and analyze system issues, code to triage, root cause and resolve issues in the infrastructure. Work closely with the platform engineering team in understanding hardware setups.
-Familiar with maintenance and setup of Linux, Windows hosts
-Scripting experience with any of Python, Go. Unix shell proficiency.
-Experience with version control systems like Perforce, GIT.
Preferred:
-Familiar with private cloud setups (VMware, Dell, Apple)
Scripting (bash, python, go)
-Experience with VM and hardware virtualization technologies like VMware, KVM, Hyper-V, Docker and Kubernetes.
-Background with automating bare metal and VM provisioning.
-Experience with supporting GPUs, embedded device development, driver development and CUDA/TensorRT applications.
-Development experience in Chef, Ansible and infrastructure orchestration.