Skip to main content

HPC Systems Engineer

Employer
Northeastern University
Location
Massachusetts, United States
Salary
Salary Not specified
Date posted
Apr 11, 2024


HPC Systems Engineer

About the Opportunity

This job description is intended to describe the general nature and level of work being performed by people assigned to this classification. It is not intended to be construed as an exhaustive list of all responsibilities, duties and skills required of personnel so classified.

Job Summary
The Research Computing (RC) systems team at Northeastern University (NU) is seeking a talented individual to fill the role of High-Performance Computing (HPC) Systems Engineer. This critical role will help operate and maintain cutting edge technologies in support of the university research computing efforts and assist Northeastern University's researchers in taking full advantage of the HPC resources located at the Massachusetts Green High Performance Computing Center (MGHPCC).

The successful candidate will maximize system uptime and ensure that research needs are met, and also help with the design and integration of new and novel technology solutions to support research, teaching, and learning.

This position is eligible for alternative work locations including remote/hybrid arrangements.

Minimum Qualifications
Requirements
  • Minimum of 3 years post-secondary education or relevant work experience.
  • At least one year of experience in a combination of: building, configuration, and administration of large Linux clusters (e.g. storage, cluster computing, network, database, virtualized systems).
  • Experience with configuration management (e.g. Ansible) and version control (Git).
  • Experience diagnosing system and application software problems.
  • Knowledge of Linux kernel internals, and kernel modules.
  • Knowledge of or experience in networking systems, including DNS, HTTP, and TCP/IP.
  • Familiarity with cluster configuration and management tools (e.g. Torque, SLURM, OGE).
  • Familiarity with data security standards (e.g. NIST-800-171).
  • Experience with confidential computing using trusted execution environments (e.g., Intel SGX and other secure enclaves on premise clusters or on cloud platforms).
  • Experience with secure and secure data processing and storage (e.g., health records, medical data) on shared HPC systems.
  • Experience with HPC systems that compute, analyze and store protected health information (PHI), personally identifiable information (PII), data protected under International Traffic in Arms Regulations, and other types of data that require privacy.
  • Demonstrated experience working in an environment with rapidly changing underlying technologies and job priorities.
  • Knowledge of or experience administering computer security software and hardware requirements.
  • Demonstrated team performance skills, service mindset approach, and the ability to act as a trusted collaborator.
  • Demonstrated strong writing skills with an ability to document and communicate solutions to users and team members clearly.
  • Ability and willingness to learn new technologies and remain current in developing trends in the HPC community
  • A strong desire and commitment to push the envelope of new technologies and opportunities and be able to communicate the potential benefits to other team members.


Preferred
  • Experience with HPC systems, in particular HPC clusters.
  • Experience with a parallel file systems (e.g. GPFS, BeeGFS, Ceph).
  • Experience with compilers, e.g. C/C++.
  • Experience with parallel computing software (MPI, openMP).
  • Experience with scripting languages, e.g. Bash, Python, Perl.
  • Experience working with Agile methodologies.
  • Experience using Public Cloud Services in AWS and Azure.
  • Experience with virtualization tools, container development and deployment/orchestration, eg Docker, Kubernetes, Terraform, Vagrant, etc.
  • Experience with automating IT infrastructure provisioning, Infrastructure as Code (IaC).
  • Experience with the use of configuration management and orchestration tools.
  • Experience with system management, monitoring tools (e.g., Ganglia, Nagios).
  • Experience managing systems utilizing GPU (NVIDIA and AMD) clusters for AI/ML jobs.
  • Knowledge of networking fundamentals including TCP/IP, traffic analysis, common protocols, and network diagnostics.
  • Understanding of infrastructure technologies including server, storage, network, database, and virtualization.
  • Demonstrated ability to quantify, analyze, determine root cause, and resolve system and communication network issues, and develop preventive actions.


Key Responsibilities & Accountabilities
  • Help administer the RC HPC cluster, storage systems and other RC infrastructure, including hardware maintenance.
  • NU HPC Systems are operational 24x7 and may require work effort beyond standard work hours
  • Implement and validate data security measures, backups, and data retention policies in compliance with funding agency requirements, and relevant regulations.
  • Diagnose, solve, and implement solutions for the HPC cluster which may include hardware repairs (break/fix), operating system configuration, system software updates, and procedure automation.
  • Proactively monitor and maintain the health and integrity of the RC systems including upgrading and patching.
  • Use and develop additional monitoring scripts and/or platforms as needed.
  • Take part in collaborative efforts defining and tracking performance metrics to ensure efficient current and future use of RC resources.
  • Assist end-users through the RC's ticket queue system
  • Assist the RC systems team with network hardware and network service maintenance and configuration.
  • Build infrastructure that compute, analyze and store protected health information (PHI), personally identifiable information (PII), data protected under International Traffic in Arms Regulations, and other types of data that require privacy.
  • Build infrastructure to migrate jobs between on-premise clusters and remote/cloud computing platforms.
  • Communicate progress and participate in reviews with the Senior HPC Systems Administrator, technical staff and senior management.
  • Work in collaboration with RC's Documentation Specialist to create new- or update existing- internal documentation in support of the RC HPC infrastructure.
  • Build and maintain relationships with external vendor technicians, engineers and support teams.
  • Participate in external collaborations (locally/regionally) such as NESE, NERC, MOC, etc.
  • Attend conferences and workshops relevant to HPC technologies to advance skills.
  • Participate in regional/national/international collaborations to advance skills and expand the NU RC solution/service catalog.
  • Promote diversity, equity, inclusion, and accessibility by fostering a healthy workplace culture.


Position Type

Research

Additional Information

Northeastern University considers factors such as candidate work experience, education and skills when extending an offer.

Northeastern has a comprehensive benefits package for benefit eligible employees. This includes medical, vision, dental, paid time off, tuition assistance, wellness & life, retirement- as well as commuting & transportation. Visit https://hr.northeastern.edu/benefits/ for more information.

Northeastern University is an equal opportunity employer, seeking to recruit and support a broadly diverse community of faculty and staff. Northeastern values and celebrates diversity in all its forms and strives to foster an inclusive culture built on respect that affirms inter-group relations and builds cohesion.

All qualified applicants are encouraged to apply and will receive consideration for employment without regard to race, religion, color, national origin, age, sex, sexual orientation, disability status, or any other characteristic protected by applicable law.

To learn more about Northeastern University's commitment and support of diversity and inclusion, please see www.northeastern.edu/diversity.


To apply, visit https://northeastern.wd1.myworkdayjobs.com/en-US/careers/job/Boston-MA-Main-Campus/HPC-Systems-Engineer_R124254-1



jeid-90afe144f9ee194ea7ebf8a8818af3df

Get job alerts

Create a job alert and receive personalized job recommendations straight to your inbox.

Create alert

Similar jobs