HPCE User Policies

HPCE User Policies

The High Performance Computing (HPC) system is managed by the HPCE team of the Computer Centre. The user policies here are developed by the Cluster Committee.

Summary

  • Home Quota = 50 GB, permanent
  • Scratch Quota = 500 GB, deleted after a week two weeks
CPU Group No. of CPU Nodes Queue Name No.Of.Cores Duration
SMALL 74 small80 1–8 2 hours
    small20 1–20 48 hours
    small40 1–40 1 week
MEDIUM 80 medium 21–40 48 hours
    long 41–80 48 hours
    verylong 41–80 1 week
LARGE 46 large 81–250 48 hours
         
GPU 12 gpuq Default 48 hours

If a faculty member or a center wants to use more compute power and storage than permitted by default, HPCE has simple pricing options.

Detail

The user policies are mentioned below.

  • help make the system usage better and effective for users,
  • decide machine time, number of cores and file system allocation to the users,
  • plan backups,
  • ensure proper maintenance of the system, and
  • enable researchers to contribute to the system

User policies are continually reviewed and updated by the Cluster Committee. Questions or concerns should be e-mailed to hpce@iitm.ac.in.

It is important that all users have access to the HPC system and that the integrity of all users’ code is maintained. Following the HPC usage policies would require that users understand the nature of HPC computing and the configuration of the hardware and software on the system. Please contact hpce@iitm.ac.in to ask questions or to report problems. If a user does not abide by the HPC usage Policies, the Cluster Committee has the right to take disciplinary action against the user. All Commercial applications have only academic licenses and these applications should be used only for research and educational purposes.

Users must use the batch submission system of the scheduler:

  • All users are requested to go through the HPCE web site before using the system as it has a lot of information about the system as well as how to use the system.
  • Users must login to the login node and use the scheduler to submit their jobs on the HPC cluster.
  • Users should not run their applications on the head nodes. All such interactive jobs will be killed after 5 minutes of CPU utilization.
  • All unused files in the scratch area to be backed up by the user. All such files will be deleted from scratch without a backup by system staff.
  • Users must not log into the compute nodes for the purpose of running a job directly without the scheduler.
  • If you believe interactive use of a node is necessary for your research then please contact the HPCE

Correspondence from the Cluster Administrative Committee:

  • Users will be notified by e-mail about issues related to the system, such as scheduled downtime, upgrades, system crashes etc.
  • Requests for information and feedback about the system and access should be by e-mail.

OBTAINING AN ACCOUNT:

To get an account on AQUA, a user has to fill a form available at https://hpce.iitm.ac.in, and has to give a proper justification for the usage of the HPC cluster. If their justification is valid, they will be given an account on AQUA.

Preference will be given to Researchers/Students/Faculty of IITM.

ACCESSING A USER ACCOUNT:

A user having an account on AQUA can access the cluster with the command:

**ssh –X @aqua.iitm.ac.in**

Where “-X” is an optional flag for exporting the graphics with ssh.

Users are strictly advised against sharing their access details with other users. The scheduler reserves available compute nodes and other resources on a first-come-first-served basis among users with equal priority, but this does not apply to system administration and testing of the machine.

COMPUTING RESOURCE USAGE POLICY:

The HPC Cluster “AQUA” has two master nodes, one login node and 272 compute nodes (CPU and GPU). All the compute nodes are of the same configuration.

1. All computing resources including CPUs, RAM, disk space, internet bandwidth and networking are to be used for research purposes only.
2. Users are responsible for using HPC clusters, resources and facilities in an efficient, effective, ethical and lawful manner.
3. Each compute node has 40 cores and 192 GB of memory.
4. Each compute node has a scratch space (/scratch).
5. Files and directories in /scratch would be automatically deleted after one week.
6. Access to the HPC cluster is to be via a secure communication channel (e.g. ssh) to the login node. Compute nodes are intended for handling heavy computational work, and must be accessed via the resource management system (PBS) only.
7. A user can login to his account on the HPC cluster using the command ssh <username>@aqua.iitm.ac.in OR ssh –X <username>@aqua.iitm.ac.in where “-X” flag is used to export display in secure shell.
8. Do Not Run Jobs on the Login Nodes, the login nodes are provided for people to access the HPC clusters. They are intended for people to set up and submit jobs, access results from jobs, transfer data to/from the cluster.
9. Each individual user is assigned a standard storage allocation or quota on /home. Each user on AQUA has a default quota of 50GB on the home directory.
10. Users who utilize more than their allocated space may not be able to submit jobs from their home directory until they clean their space and reduce their usage, or they can also request for additional storage with proper justification, which may be allocated to them, subjected to the availability of space.
11. All computational jobs run on the cluster systems must be submitted via the resource management system (PBS). This enables resources to be sensibly allocated and most efficiently used. The job submission scripts are available at: https://hpce.iitm.ac.in/
12. Users are advised to use /scratch on compute nodes for storing the intermediate results, because using their respective home directories for writing the results when running jobs can easily fill up the home directory and may lead to termination of the job as well. They can specify the writing location in the job submission script or in the executable wherever possible and copy the final result to their home directory by defining it at the end of the job submission script.
13. There is no system backup of data from /scratch or from any other partition except the home directory (/home), it is the user's responsibility to back up his/her data on a regular basis that is not stored in his/her home directory. System staff cannot recover any data in any other location other than /home, including files lost to system crashes or hardware failure so it is important to make copies of your important data regularly.
14. Users are requested to report for any weaknesses in computer security and incidents of possible misuse or violation of the account policies to the HPC administrators or write regarding the same to hpce@iitm.ac.in.
15. User accounts may not be shared.
16. No proxy work, you may not submit jobs on behalf of someone else.

Any processes that may create performance or load issues on the head node or interfere with other users’ jobs will be terminated.

Serial queue: Any job requesting for a single core will be routed to the serial queue for execution.

Parallel queue: A job requesting more than one core will be steered to the parallel queue.

SCRATCH SPACE USAGE POLICY:

Users are provided a scratch space (/scratch) to store their intermediate results. This temporary disk space is provided to run their programs instead of their home directory or project space. This will increase performance, depending on the volume of the data set. Any job that causes heavy load on our PFS server will be terminated without advance notice. There is no system backup for data in /scratch, it is the user’s responsibility to back up data. We cannot recover any data in /scratch, including files lost to system crashes or hardware failure so it is important to make copies of your important data regularly.

Administrators reserve the right to clean up the scratch space any time if it is needed to improve the system performance. Any file belonging to any user, residing in the scratch space for more than two months will be cleared/deleted by the administrator without prior notice. So, users are strictly advised to regularly clean their intermediate files/results in /scratch as soon as their job gets completed. Users are strictly advised against putting important source code, scripts, libraries, executable in /scratch. These important files should be stored in /home. Do not make a soft link for the folders in /scratch to /home for /scratch access.

File Deletion Policy

The scratch service is temporary storage, and it is not backed up. Data stored on this service is not recoverable if it is lost for any reason, including user error or hardware failure. Data that has not been accessed for more than or equal to a week will be removed from the system. Any user found violating this policy will be contacted; further violations may result in the account being locked.

Renewal Policy

User accounts are created for a period of 1year duration and after this the user accounts have to be renewed. Account expiration notification emails will be sent 30 days prior to expiration and reminders will also be sent in the last week of the validity period. You can use https://hpce.iitm.ac.in site to renew your account.

Please note that in the event an account expires, all files associated with the account on home directory and related files will be deleted after a period of 3 months from the validity period.

Shared Lustre File Systems Usage

This section focuses on ways to avoid causing problems on $HOME and $SCRATCH. File Systems above are a brief overview of file systems in the cluster. Configuring user Account covers environment variables and aliases that help the users to navigate the file systems. “File Operations: I/O Performance” addresses optimization and parallel I/O.

  • Don’t run jobs in $HOME. The $HOME file system is for routine file management, not parallel jobs.

  • Don’t get greedy If you know or suspect your workflow is I/O intensive, don’t submit a pile of simultaneous jobs. Writing restart/snapshot files can stress the file system; avoid doing so too frequently.

  • Watch your file system quotas. If you’re near your quota in $SCRATCH and your job is repeatedly trying (and failing) to write to $SCRATCH, you will stress the file system. If you’re near your quota in $HOME, jobs run on any file system may fail, because your job may write some data to $HOME/tmp directory.

  • Avoid opening and closing files repeatedly in tight loops. Every open/close operation requires the MDS, which is a potential point of congestion. If possible, open files once at the beginning of your program/workflow, then close them at the end.

  • Avoid storing many small files in a single directory, and avoid workflows that require many small files. A few hundred files in a single directory is probably fine; tens of thousands is almost certainly too many. If you must use many small files, group them in separate directories of manageable size.

Data Transfer

  • Avoid too many simultaneous file transfers. You share the network bandwidth with other users; don’t use more than your fair share. Two or three concurrent scp sessions is probably fine. Twenty is probably not.
  • Avoid recursive file transfers, especially those involving many small files. Create a tar archive before transfers.

JOB SCHEDULING POLICY:

Below is the initial set of job scheduling policy decisions. HPCE would closely monitor effectiveness of these values on users’ jobs and would finalize the values in a couple of months.

Queues

Instead of partitioning the cluster’s compute power into several groups, we would like to keep the number of groups small and have larger portion of the compute power available for users. Towards this, we have three groups: SMALL, MEDIUM, LARGE, containing five queues. The rationale is to advocate more parallel jobs, have easy scheduling decisions, and fair share to all users.

Group No. of Nodes Queue No. of Cores Duration

SMALL 30 small8 1–8 2 hours

                    small40   1--40          24 hours

MEDIUM 130 medium 21–40 48 hours

                    long      21--40         48 hours -- 1 week

LARGE 100 large 41–520 48 hours

The current policy allows small+short jobs, large+short jobs, small+long jobs, but disallows large+long jobs, to keep the cluster usable by multiple users and not being monopolized by a user. Towards this, to have a fair chance for all users, HPCE would consider restricting the number of jobs per user after a month.

All the GPU nodes (12 in number) would be in the same queue, separate from the CPU queue.

The queue scheduling policy for both the CPU and the GPU nodes would be FIFO and Fair Share (which are quite standard and also used by Virgo).

Separate queues would be created for departmental servers which are part of the cluster. Those would be accessible to only those users as approved by the corresponding faculty member. However, the general pool of compute nodes above would be accessible to all HPCE users.

Restriction Per User

Group Queue No. of Jobs & Cores
small grp Running jobs max 10
  Queued jobs max 10
  Core restriction max 80
medium grp Running grp max 10
  Queued jobs max 10
  Core restriction max 80
large grp Running jobs max 10
  Core Restriction max 200
gpu grp Running jobs max 4

EMAIL POLICY:

The email account provided by the users in the account request form will be automatically subscribed to the HPCE Users system mailing list (hpceusers@lists.iitm.ac.in) for important system announcements. It is the responsibility of the user to ensure that a valid IITM email address is provided.

Building Open Source Software in User Account

Users are welcome to download third-party research software and install it in their own account. In most cases you’ll want to download the source code and build the software compatible with the AQUA software environment. You can’t use yum or any other installation process that requires elevated privileges, but this is almost never necessary. The key is to specify an installation directory for which you have write permissions.

MAINTAINING ACCESS:

The most important way each user sustains the system is by demonstrating that it contributes to ongoing research.

REQUEST FROM HPCE:

Please acknowledge the use of the HPCE resources in papers and presentations. A recommended acknowledgement is “The computational results reported in this work were performed on the AQUA Cluster at the High Performance Computing Environment of IIT Madras”.

POLICY VIOLATIONS:

If it has been determined that any user has violated any of the HPC resource policies, or any other computing policies, then the user will be liable for strict disciplinary action by the Cluster Committee.

GETTING HELP:

If you have any questions or concerns regarding any of these policies, please send an email to hpce@iitm.ac.in.

Note: This policy would be updated as the new policies are decided by the Cluster Committee.