Job Description
Job Title:
HPC System Engineer, System, NSCC
Requisition ID:
603
Posting Start Date:
01/04/2026
Job Summary
The HPC System Engineer will design, optimize, and maintain HPC system architecture, including compute, interconnect, and storage components. This role involves advanced performance tuning, resource planning, and technology evaluation to ensure scalability, reliability, and security of NSCC?s supercomputing infrastructure.
Roles and Responsibilities
- System Engineering & Optimization
- Evaluate HPC system architecture, including compute, interconnect, and storage components.
- Collaborate with HPC System Administrators to ensure system reliability and performance.
- Assist in performance tuning and root-cause analysis for complex system-level issues.
- Develop and maintain utility tools for system diagnostics and performance profiling.
- Resource & Workload Management
- Configure and optimize job schedulers (e.g., Slurm, PBS Pro) to maximize resource utilization and throughput.
- Develop and enforce policies for resource allocation and workload prioritization.
- Design & Planning
- Assess future computational requirements and contribute to HPC system architecture design.
- Evaluate emerging technologies (processors, accelerators, interconnects, storage solutions, programming models).
- Compliance & Risk Management
- Define and implement security policies in collaboration with administrators.
- Conduct regular security checks and ensure compliance with organizational standards.
- Collaboration & Documentation
- Work closely with Middleware and Storage Engineers to ensure system compatibility.
- Document system architecture, configurations, and engineering decisions.
Qualifications:
- Degree in a Computer Science, Engineering, IT or other relevant areas.
- At least 3 years of experience in managing HPC systems.
- Highly proficient in UNIX/Linux environments and command line interface (CLI).
- Experience with cluster management software (xCAT, BCM, PHPC, HPCM).
- Experience with job scheduling and workload management software (Slurm or PBS Pro)
- Strong knowledge of HPC storage principles and experience in managing parallel file system (Lustre, GPFS, BeeGFS).
- Strong knowledge of RDMA-based interconnect (InfiniBand, RoCE).
- Understanding of basic network protocols like DHCP, DNS, TFTP, SMTP, etc.
- Good knowledge of scripting languages like Python, Bash or Perl.
- Demonstrate ability to analyse complex issues and develop effective solutions.