Frequently Asked Questions

From UMass GHPCC User Wiki
Jump to: navigation, search

Contents

How do I get an Account on MGHPCC?

https://www.umassrc.org/hpc/

How do I run jobs on the cluster?

All jobs are submitted using the LSF scheduler

How can I find out the status of my jobs and of the cluster?

We can run the commands: bqueues and bhosts to determine jobs running and on which nodes jobs are executing

How can I see a list of resources my job is using?

Running bhosts -l will provide resources requested, and info on the running job

bhosts -l {jobid}

Why did my job exit with a failure, and how can I find out why?

If your job fails you can run a bjobs -dl {jobid}

bjobs -dl {jobid}

What do the error codes returned from an LSF job mean?

Exit Code 	Meaning:
< 127 	Exit code from your script. Anything >0 is an error. The exact reason will be application dependent.
127 	Command not found / wrong binary architecture. Check your submission script.
> 128 	Jobs killed by a signal. Subtract 128 from the code to get the signal number. See "man 7 signal" for the list of signals.
130 	Sig-INT. LSF will send this if memory or runtime limits are exceeded.
139 	Sig-SEGV (signal 11, core dump). The program died due to an segmentation violation.  
140 	Sig-USR2. LSF will send this if memory or runtime limits are exceeded. (Sent shortly before sig-INT; you can catch
       this signal to allow your program to tidy-up.). 

I don't know Linux. Help?

Linux tutorials are a bit beyond the assistance we're able to provide. You should contact your local research computing or IT organization for assistance. Alternately, there are some free online guides available:

My jobs fail when running on ghpcc-sgi. What's going on?

While this large SGI system with 512 cores is running the Linux operating system, it's running SuSE rather than Red Hat Linux. Many compiled applications will not have a problem, though some applications may fail with library errors since the two distributions have slight differences in version numbers. At release time, only SuSE was supported by SGI, though it now appears that Red Hat is a supported option. As the GHPCC team works on this migration, you can avoid use of the SGI by including the following in your bsub command:

-m blades

This will ensure that your jobs only go to blade systems and not the SGI.

What kinds of jobs should I submit to MGHPCC?

Any research job that meets the following criteria:

  • Runs or compiles on Linux
  • Uses over 1GB of memory
  • Uses over 15 minutes of CPU time
  • Can be run in parallel (optional)

Job that run on Microsoft Windows or Apple OS X are not supported in any way. Fortunately most commercial and free software used in research organizations are available for Linux.

There's lots of jobs in the queue. Should I wait until the line dies down?

NO! Submit your job as soon as you're ready. If you haven't used many CPU resources in the past few days, your fairshare score may be high enough to allow you to skip to the front of the line. Even if not, you'll still be in line and will have your job run. The only way your job will ever run is if it's submitted to the queue.

Jobs can be pending in the queue for a number of reasons. If a job requires a large amount of resources it may remaing pending for a long time waiting for those resources to become available. Other jobs with different requirements may run instead.

Where should I submit my jobs?

This depends on the kind of job you have.

  • short: Jobs that use less than 4 hours of run time.
  • long: Jobs that use up to 30 days of run time. May be suspended by jobs in other queues.
  • interactive: Do you need a shell on a compute node or have a job that requires input while it runs? This is the queue for you.
  • gpu: For jobs that use CUDA to interact with GPUs. Nodes with GPUs on them do not run jobs from other queues.
  • The previous parallel queue has been removed.

What is the difference between run time and CPU time?

Run time is the amount of time a job is in the RUN state (this may also be called wall time since it refers to the clock on the wall). This is the time that the job is actually running. Any time that your job is suspended or waiting to run does NOT count against your run time.

CPU time is the amount of CPU time is consumed. Jobs that have a lot of IO to a disk or network does not count against your CPU time since the CPU is waiting for the IO to complete. For jobs that request only a single job slot, the CPU time will be equal to or less than the run time. If you request more than one job slot, the CPU time will be set to the run time multiplied by the number of cores requested. So if you request an hour of run time and request four job slots, your resulting CPU time will be four hours. This is important because if you request four job slots but your job really uses eight, your job will consume four hours of CPU time in about 30 minutes of run time and be terminated after exceeding the CPU time limit. It's very important to make sure you use only the number of cores you request.

Is it better to run a job requesting 1000 cores for 12 hours, or 1000 jobs each requesting 1 core for 12 hours?

Requesting a large number of cores for a single job can increase the amount of time your job is pending waiting to start. By splitting your job into smaller chunks that can run and are queued independently, the odds of your jobs starting sooner increase.

Why is has my job not started? (Why is my job stuck in the PEND state?)

If your job is remaining PENDING, your memory or CPU requirements may be too high to allow it to be dispatched. If you have used a lot of CPU time, your fairshare score may be low which means your jobs are less likely to be dispatched if the cluster is otherwise full. It's best to leave the job in the queue. As your fairshare score rises and others with running jobs drop your jobs will be dispatched over time.

When submitting jobs that require multiple cores, remember that the memory requested is multiplied by the number of cores. If your resource requirements look like this:

-n 4 -R "rusage[mem=32768]" -R "hosts[span=1]"

This really requests 32GB*4 = 128GB on a single node. If you really only need 32GB, then request 32GB/4=8GB:

-n 4 -R "rusage[mem=8192]" -R "hosts[span=1]"

Another reason why jobs may be stuck in PENDING is because you are asking for too many CPUs on a single system. If two 32-core compute nodes each have 16 job slots in use for other jobs, there are only 16 cores available on each system and thus a job requesting 24 cores on a single system will not start. This turns into a question for you on having more job slots so the job is in the RUN state for a shorter period of time but a possibly longer PEND time or requesting fewer job slots so the job is in the RUN state for a longer period of time but may start faster.

Finally, the cluster might just be very busy. Even if your resource requirements are modest, if there are a lot of jobs in the short and parallel queues, there are not many resources left over for jobs in the long queue. All LSF does is make sure your job get the amount of run time you request but cannot make guarantees on when the job will dispatch.

Why do I need to specify memory and time limits for my job?

In order for jobs to be accurately dispatched to the correct nodes, the queueing system needs to know what resources each job needs. Parallel jobs that require a larger number of cores can start faster when LSF knows when jobs are expected to finish and then start reserving job slots in preparation.

Memory reservations have a more direct impact on jobs. If a job runs awry or takes more memory than expected, it can cause the entire node to become unavailable or reboot. In the early days of HPC clusters, this usually meant only one other job is impacted. In more modern systems this can be up to 63 or more jobs being impacted. By listing memory requirements, you can ensure that your job is sent to a node with the correct amount of memory and that memory will be available for the entire time your job runs.

My job is in the UNKNOWN state and I can't kill it. What do I do?

This is usually because the node the job is running on has lost communication with the LSF master server. If this is a transient node or network issue it will clear itself and your job will continue to run. In the event the node has crashed, you will need to wait for someone from the MGHPCC staff to reboot the node. Your job will be terminated after the node restarts.

What is the difference between EXIT and DONE?

DONE states mean that as far as LSF knows, the job completed successfully and ended with an exit state of 0. EXIT means that something else happened: LSF terminated the job for exceeding memory or CPU limts, the application returned an exit code other than 0, or you killed the job for some reason. The job report that is sent to you contains information about why a job is in EXIT rather than DONE.

How much does this cost?

MGHPCC shares costs for hardware, rack space, electricity, network, and support staff between the member campuses. How those costs are broken down is on a per-campus basis, so please consult with your local IT or finance staff for more information. Even if you are charged, the resulting cost to you will be less than building your own cluster or even building one in the cloud, and you have access to a set of experts in Linux and Scientific Computing to help you out.

How can I get more disk space?

Each campus has an allotment of disk space assigned to it. Most campuses start with 22TB and your home directories and project space come out of that location. If you have a need for more disk space and the amount for your campus is insufficient, you will need to work with your local IT staff to have more storage added for your campus.

Why did my job die?

There's a few different reasons why this is the case. Please look into the job report that is returned to you for more information.

  • Your job exceeded its memory limit. You may need to increase the amount of memory your job requests.
  • Your job exceeded its wall time or CPU time limits.
  • Your application crashed or died while running. The STDOUT or STDERR files or your job report should contain information from your application with more information.
  • The node your job was running on crashed. We work very hard to make sure that they remain available all the time, but there are occasionally times where a node will experience a problem and the jobs on it may end. We make an effort to disable nodes if we detect a failure, but please feel free to contact us if you see very erratic behavior.

I'm trying to use bpeek to watch my job output and I'm not seeing anything

LSF doesn't force a flush of standard output (STDOUT), so this can cause applications to buffer output, sometimes until the job is done. You may need to modify your code to force fflush after each write. From python, you can add the following to your code with no other modifications:

import sys
import os
# Flush STOUT continuously
sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

I can't log into my account. What's wrong?

There's a number of things you can try and figure out where the problem is.

If your ssh client says it cannot reach the server: Make sure you are on your local campus network or connected via VPN. Check and see if you are able to connect to http://wiki.umassrc.org. If you are reading this page, then this is working. Since the SSH server and wiki servers are on the same physical machine, it's possible there's a system-level problem causing problems.

If your password is invalid: Make sure you are using the correct account name. Remember that accounts on MGHPCC have the form of your initials, a two digit number, and the abbreviation for your campus e.g. awr15w. Please securely store the password you were provided at the time your account was created and be sure to change it after you log in. We expect in late 2015 we will switch to being able to use your password from your local campus. We will provide additional information once this is available.

How do I use screen

Screen is a utility to which keeps your shell session active even when you are disconnected from a Linux server (ex: ghpcc06)

As an example, if we want to start a screen session we simply type

$ screen

This will create a new "screen" ptty session available for you

You can exit this session by using logout, exit, or you can keep this session active by using the key-bind commands

Key Bidning commands, while in screen type of the following (where <ctrl is the control key)
To create a new session 
<ctrl> + a + c
To switch between sessions (replace 0-9 with the session number)
<ctrl> + a + 0-9
To list all of the running sessions:
<ctrl> + a + "

To keep a screen session running, but detach from it type: 
<ctrl> + a + d
To list all of the available commands:
<ctrl> + a + ?

To list screen pttys available:

$ screen -ls
There is a screen on:
       59850.pts-18.ghpcc06    (Detached)
1 Socket in /var/run/screen/S-awr.

To re-attach to a screen sessions:

$ screen -r {pid}
$ screen -r 59850.pts-18.ghpcc06

To exit from a screen sessions

$ exit

I need to install software. Can I get the root password or sudo access?

Unfortunately no. As a shared resource we need to keep the number of users that have elevated access to a minimum to maintain security and stability. If you have software you would like installed, you can request we install it or you are free to install any software into your home directory.

I'm compiling something and it fails complaining "cannot find -lquadmath"

Chances are you're compiling using the base os's gcc; try loading the gcc module (currently gcc/4.8.1) and retry.

What's the difference between slots, cores, and nodes? How many should I request? What do Max Processes and Max Threads have to do with slots?

For purposes of the cluster, a core and a slot are the same thing; what most people would generally refer to as a cpu (though now generally a CPU is a physical unit that contains multiple cores). A node is a single computer on the cluster used to run the jobs. We have nodes that range from having 8 cores to ones with 64 cores, and a single host (ghpcc-sgi) which has 512. You can use the “bhosts” commands to get a list of the nodes in the cluster and how many cores each has. The “lshosts” command is similar, but gives more detail, including how much memory each node has.

A ‘slot’ is actually a term specifically related to the scheduler, when you submit a job the “-n” parameter tells it how many slots the job needs, and the scheduler assigns it to a node that has at least that many free slots available. In our cluster one slot is always equal to one core (though it’s possible other clusters may be configured to have multiple slots per core). If you don’t specify “-n” option then the scheduler automatically sets your job to use a single slot.

If you simply specify more than one slot for a job, the scheduler will find available slots wherever it can, which potentially means it might give you slots on multiple nodes. This works fine for applications using MPI or something similar, but most non-MPI applications that use more than one core at a time expect that all the cores are on the same node. To tell the scheduler that all the slots need to be on the same node you’d want to use the flag:

  • -R span[hosts=1]

in addition to the “-n” flag with the number of slots.

The Max Processes and Max Threads reported by LSF when your job completes are related to the number of slots (cores) you’ll want to request. Each process, or thread, can potentially use one core or slot by itself. Parallel programs will start multiple processes, threads, or both, so they can run on different cores at the same time.

However, each thread or process doesn’t necessarily use up the entire core; the ones doing the actual processing will tend to use an entire core, but there can be multiple coordinating processes and threads that only use small amounts of cpu time, and thus don’t need a core all to themselves. Many parallel applications will have a parameter for specifying how many cores or cpus to use, and usually you’ll want your job to request the same number as is specified for that, possibly adding one more slot for the coordinating tasks.

Telling how many cores or slots your job is actually using can be a little tricky, as the best measure is to see how many cpu seconds the application has used in total divided by the number of seconds that it’s been running. (The bjobs -l command can tell you the number of cpu seconds used, and the bhist command can show you how many seconds a job has been running). We have a utility “/share/bin/jobcpucheck” which will do the calculations for you and tell you how many cores a running job is using compared to the number of slots requested for it.

This does mean that a job that uses, say, eight cores for the first 5 minutes and one core for the second file minutes will average at 4.5 cores if viewed at that point, but you'd want to submit it requesting 8 slots to cover the maximum core usage.

If a program can’t be told to restrict itself to a certain number of cpus, and will always try to use all available cores on the node it runs on, things get messy in terms of trying to schedule the jobs. If your program falls into this category, contact us at hpcc-support@umassmed.edu and we can look at alternate ways of submitting so it won't cause problems.