Job Description
Job Title:
HPC AI Engineer, Frontier, NSCC
Requisition ID:
591
Posting Start Date:
01/04/2026
ABOUT THE ROLE
As our HPC AI Engineer, you will be a key expert supporting researchers in leveraging our new supercomputer system for large-scale artificial intelligence. You will support and optimise massive AI application workloads, working with performance engineers to profile AI applications and establish best practices. Your work will directly enable national-scale projects in multimodal AI, healthcare, and AI for Science.
RESPONSIBILITIES
- Provide HPC and scientific domain advice to users of NSCC systems.
- Engage and collaborate with new researchers, communities, and disciplines with computationally intensive requirements.
- Support and optimise large-scale AI application workloads.
- Work with HPC performance engineers to profile and build performance models of the AI applications and workflows.
- Design, develop and implement HPC software best practices for AI applications and workflows.
- Assist in the planning and design of future HPC systems, including benchmarking AI workloads on various platforms and recommending the most suitable architecture for the research community.
- Analyse system and user job data for efficient resource allocation and management.
- Develop HPC utilities, dashboards and automated testing tools for NSCC HPC systems.
- Develop HPC user and best practice guides for NSCC HPC systems.
- Get up-to-date with scientific domain research development, HPC system and software technology.
QUALIFICATIONS
- Bachelor degree in the field of computer science, computer engineering, or other relevant areas.
- Proven working knowledge of models and algorithms in at least one area of generative models, computer vision, graph neural networks, or AI for Science applications.
- Ideally, 3 years of experience in developing codes for AI training and inference.
- Experience in setting up AI software stacks, familiar with diversified AI software stacks.
- Good knowledge in AI application performance optimisation and troubleshooting.
- Strong programming skills in Python; familiar with C/C++ programming is a plus.
- Familiar with the working and using of AI frameworks (e.g. PyTorch, Tensorflow, JAX) for research.
- Familiar with GPU architectures and programming is highly desired.
- Familiar with Linux environment, scripting languages, profiler and debugger tools.
- Familiar with HPC job schedulers and container technologies.
- Familiar with object storage (S3); familiar with HPC storage (Lustre) is a plus.
- Demonstrated team player with strong problem-solving skills.
- Demonstrated effective communication skills including the ability to articulate technical concepts to a diverse range of audiences.
- Demonstrated ability and willingness to contribute novel ideas and approaches in support of the research community.
- Demonstrated passion for continuous learning and exploring new technologies or domains.