|
ABOUT THE ROLE As our HPC AI Engineer, you will be a key expert supporting researchers in leveraging our new supercomputer system for large-scale artificial intelligence. You will support and optimise massive AI application workloads, working with performance engineers to profile AI applications and establish best practices. Your work will directly enable national-scale projects in multimodal AI, healthcare, and AI for Science.
RESPONSIBILITIES
-
Provide HPC and scientific domain advice to users of NSCC systems.
-
Engage and collaborate with new researchers, communities, and disciplines with computationally intensive requirements.
-
Support and optimise large-scale AI application workloads.
-
Work with HPC performance engineers to profile and build performance models of the AI applications and workflows.
-
Design, develop and implement HPC software best practices for AI applications and workflows.
-
Assist in the planning and design of future HPC systems, including benchmarking AI workloads on various platforms and recommending the most suitable architecture for the research community.
-
Analyse system and user job data for efficient resource allocation and management.
-
Develop HPC utilities, dashboards and automated testing tools for NSCC HPC systems.
-
Develop HPC user and best practice guides for NSCC HPC systems.
-
Get up-to-date with scientific domain research development, HPC system and software technology.
QUALIFICATIONS
-
Bachelor degree in the field of computer science, computer engineering, or other relevant areas.
-
Proven working knowledge of models and algorithms in at least one area of generative models, computer vision, graph neural networks, or AI for Science applications.
-
Ideally, 3 years of experience in developing codes for AI training and inference.
-
Experience in setting up AI software stacks, familiar with diversified AI software stacks.
-
Good knowledge in AI application performance optimisation and troubleshooting.
-
Strong programming skills in Python; familiar with C/C++ programming is a plus.
-
Familiar with the working and using of AI frameworks (e.g. PyTorch, Tensorflow, JAX) for research.
-
Familiar with GPU architectures and programming is highly desired.
-
Familiar with Linux environment, scripting languages, profiler and debugger tools.
-
Familiar with HPC job schedulers and container technologies.
-
Familiar with object storage (S3); familiar with HPC storage (Lustre) is a plus.
-
Demonstrated team player with strong problem-solving skills.
-
Demonstrated effective communication skills including the ability to articulate technical concepts to a diverse range of audiences.
-
Demonstrated ability and willingness to contribute novel ideas and approaches in support of the research community.
-
Demonstrated passion for continuous learning and exploring new technologies or domains. |