Q: When I request an interactive queue/session, I get an error. How can I get past this?
Interactive queue/session
srun --pty --x11 -t 24:00:00 -n 1 -p ilg2.3 bash -i
Error encountered
srun: error: run_command: xauth poll timeout @ 100 msec
srun: error: Problem running xauth command. Cannot use X11 forwarding
Q: Which partitions do I have access to?
sacctmgr -n -p show associations user=$USER | awk -F\| '{print $4}' | sort -u
OR
sinfo -h | awk '{print $1}' | sort -u
OR
sinfo -h | cut -d' ' -f1 | sort -u
Q: What are the attributes/constraints of partitions I have access to?
A: On the new subcluster everyone has access to everything except private partitions (atlas, titanx, moore, randerson and wodarz)
scontrol show partition
Q: My job got killed for some reason. Any tips on tracking this down?
The following command helps track down possible causes where <jobid> is the your job
sacct -j <jobid> -o "JobIDRaw%15,JobName%15,NCPUS%5,NNodes%6,NTasks%6,MaxRSS,MaxRSSNode,ReqMem,AllocTRES%25,TotalCPU,Elapsed,Timelimit,ExitCode"
Another possibility
sacct -j <jobid> -o JobID,JobName%30,Partition,User,Account,AllocCPUS,State%20,ExitCode
Available format fields can be listed with "sacct -e"
Q: Is there a way to get squeue to show output like the qstat I am used to?
squeue -o "%15i %.25j %.8u %.10M %.2t %19P"
altenatively you can make this the default view for squeue via
export SQUEUE_FORMAT="%15i %.25j %.8u %.10M %.2t %19P"
#SBATCH --output=jobname-%j.out
#SBATCH --error=jobname-%j.err
where jobname is some meaningful identifier unique for the job
sacct -u $USER --format=JobID%15,JobName%25,MaxRSS,Elapsed%15,Start,End,State%15,ExitCode,DerivedExitCode -S 2019-01-01 -E 2019-01-12
sacct -u $USER --format=JobID%15,JobName%25,MaxRSS,Elapsed%15,Start%15,End,State,ExitCode,DerivedExitCode -S 2019-01-01 -E 2019-01-12 | egrep -v "batch|extern" | grep COMPLETED
sacct -u $USER --format=JobID%15,JobName%25,MaxRSS,Elapsed%15,Start,End,State%15,ExitCode,DerivedExitCode -S 2019-01-01 -E 2019-01-12 | egrep -v "batch|extern" | grep FAILED
Timeout only
sacct -u $USER --format=JobID%15,JobName%25,MaxRSS,Elapsed%15,Start,End,State%15,ExitCode,DerivedExitCode -S 2019-01-01 -E 2019-01-12 | egrep -v "batch|extern" | grep TIMEOUT