The ‘large’ queue on the UMass shared GHPCC is currently available for shared cluster users to submit jobs to. The large queue is specifically intended to address difficulties users have when needing to run large multi-node MPI jobs.
- The queue will eventually have access to 96 Intel nodes, each with 20 cores and 128G of memory.
- As of this writing, nine nodes have been transitioned and are available. We will be moving additional nodes over as we are able.
- All nodes available to the large queue have been configured with both InfiniBand and 10G Ethernet connectivity. They will be set up so TCP traffic is only sent over Ethernet, reserving InfiniBand exclusively for job communication.
- The nodes available to the large queue will also eventually be available for use by jobs from the short queue. Because the large queue will have a higher priority, short jobs should not delay large jobs by more than four hours. Large queue jobs will still need to wait for resources occupied by other large queue jobs. Short jobs will be enabled on these nodes after we are comfortable that large queue jobs are running successfully.
Initial restrictions for the queue have been set by the Faculty Advisory Committee, and are subject to change as we see how well the queue works:
- Users can have many jobs pending in the queue at the same time, but no user will be able to run jobs on more than 320 cores (16 nodes) in the large queue simultaneously.
- Jobs in the large queue will have a maximum runtime of 48 hours (two days).
- Large queue jobs will automatically be set to exclusive mode; they will be granted all 20 cores on each node they use, and will not share nodes with other jobs.
- Large queue jobs requesting cores not divisible by 20 will automatically be rounded up to multiples of 20.
- The ‘span’ resource requirement will automatically be set to ‘span[ptile=20]’ to ensure jobs use whole nodes.
- Jobs will still be able to request a range of cores, but the minimum and maximum will be rounded up to multiples of 20, e.g. “bsub -q large -n 20,100”.