- Support the operation and maintenance of Linux servers, ensuring operational availability & performance, conducting health checks, managing software upgrades, patching (including testing and implementation), system optimization and administration.
- Monitor server health and performance to identify issues, bugs, or potential improvements
- Strict adherence to change management processes to ensure changes are properly planned, documented, and deployed
- Develop, review, and update existing operational documentation (SOPs, application checklists, playbooks, etc)
- Provide after-hours on-call technical support
- Collaborate with the Security Operations Center (SOC) team for process optimization, tool tuning & integration, information sharing, playbook development and incident response
- Implement automated near real-time monitoring of all tools to ensure proper operation and collection of pertinent data
- Incident and Problem Management; including both during and post-incident, along with Root Cause Analysis
- Application support, issue management and escalation
- Perform incident investigation, diagnosis, and resolution
- Perform system monitoring and remediation
The successful candidate will meet the following qualifications:
- 7+ years of experience installing, administering, and maintaining Oracle or Red Hat Linux based servers
- 5+ years of experience designing and implementing redundant systems including data backups/recoveries, high availability, load balancing, and disaster recovery
- 5+ years of experience designing, analyzing, and repairing large-scale distributed systems
- Experience with deploying and maintaining AWS and on-premises Linux servers
- Experience in application deployment automation, modern DevOps practices, and infrastructure as code
- Experience with IT automation tools such as Ansible Automation Platform, Chef, Puppet, or Terraform
- Knowledgeable of core IT infrastructure technologies including virtualization, networking, and storage management
- Technical documentation skills
- Comfortable interacting with management at various levels in a professional manner
- Takes ownership of areas of responsibility and makes recommendations and decisions on the improvement and operation of those areas
- High level of organizational skills
- Knowledge of and experience with Security Design and Implementation
- Ability to participate in after-hours on-call rotation
- Knowledge of backup and recovery methods and verification
- Knowledge of EMC PowerMax and Isilon storage, including snapshots
- Excellent written and verbal communications
- Ability to work in a fast paced, schedule-driven, and customer-oriented environment
- Experience with Bash, Perl, and Python scripting
- Experience with LVM including online expansion of file systems
Preferred Qualifications:
- Experience supporting container-based platforms
- SUSE Manager for patching Linux servers
- Red Hat Satellite for patching Linux servers
- Prometheus and Grafana for system performance monitoring.