FAQ

Q: When I request an interactive queue/session, I get an error. How can I get past this?

Interactive queue/session

srun --pty --x11 -t 24:00:00 -n 1 -p ilg2.3 bash -i

Error encountered

srun: error: run_command: xauth poll timeout @ 100 msec
srun: error: Problem running xauth command. Cannot use X11 forwarding
A:
1) on gplogin2/3, delete any leftover .Xauthority files "rm ~/.Xauthority*"
2) log out of gplogin2/3
3) log back in to gplogin2/3. This should re-generate your ~/.Xauthority file.
 

Q: Which partitions do I have access to?

sacctmgr -n -p show associations user=$USER | awk -F\| '{print $4}' | sort -u

OR

sinfo -h | awk '{print $1}' | sort -u

OR

sinfo -h | cut -d' ' -f1 | sort -u

Q: What are the attributes/constraints of partitions I have access to?

A:  On the new subcluster everyone has access to everything except private partitions (atlas, titanx, moore, randerson and wodarz)

scontrol show partition

Q: My job got killed for some reason. Any tips on tracking this down?

The following command helps track down possible causes where <jobid> is the your job

sacct -j <jobid> -o "JobIDRaw%15,JobName%15,NCPUS%5,NNodes%6,NTasks%6,MaxRSS,MaxRSSNode,ReqMem,AllocTRES%25,TotalCPU,Elapsed,Timelimit,ExitCode" 

Another possibility

sacct -j <jobid> -o JobID,JobName%30,Partition,User,Account,AllocCPUS,State%20,ExitCode

Available format fields can be listed with "sacct -e"

Q: Is there a way to get squeue to show output like the qstat I am used to?

squeue -o "%15i %.25j %.8u %.10M %.2t %19P"

altenatively you can make this the default view for squeue via

export SQUEUE_FORMAT="%15i %.25j %.8u %.10M %.2t %19P"
credits go to NASA
 
Q: What's an easy way to keep track of what is happening with a job?
 
It's probably a good idea to seperate errors (stderr) from output (stdout)
#SBATCH --output=jobname-%j.out
#SBATCH --error=jobname-%j.err 

where jobname is some meaningful identifier unique for the job

Q: I can't find my jobs via squeue or my interactive job was killed. How can I find out that status of all my jobs between two dates?
 
All jobs regardless of status
sacct -u $USER --format=JobID%15,JobName%25,MaxRSS,Elapsed%15,Start,End,State%15,ExitCode,DerivedExitCode -S 2019-01-01 -E 2019-01-12
Completed only
sacct -u $USER --format=JobID%15,JobName%25,MaxRSS,Elapsed%15,Start%15,End,State,ExitCode,DerivedExitCode -S 2019-01-01 -E 2019-01-12 | egrep -v "batch|extern" | grep COMPLETED
Failed only
sacct -u $USER --format=JobID%15,JobName%25,MaxRSS,Elapsed%15,Start,End,State%15,ExitCode,DerivedExitCode -S 2019-01-01 -E 2019-01-12 | egrep -v "batch|extern" | grep FAILED

Timeout only

sacct -u $USER --format=JobID%15,JobName%25,MaxRSS,Elapsed%15,Start,End,State%15,ExitCode,DerivedExitCode -S 2019-01-01 -E 2019-01-12 | egrep -v "batch|extern" | grep TIMEOUT