Scripts to help make PBSPro useful to ALCF users.
Queries various status fields across all Aurora nodes, or all nodes in a specified "partition." The partitions are associated with general or specialized queues. Now that Aurora is in production, almost all nodes are in one partition, still named lustre_scaling
for historical reasons. The prod
routing queue, prod-large
queue, and various debug*
queues are in the lustre_scaling
partition. If no partition is specified on the pu_nodeStat
command line, it reports for all partitions.
Results are summarized in a table with each row showing the number of nodes in a specific status (free
, down
, etc.).
Currently Aurora-only.
Produces a table of reservations with more useful set of information than default available PBS commands. Displays results in a table with user-friendly column headings such as "Reservation", "queue", "nodes", "start", and "end"
pbsn
filters and categorizes PBS nodes based on their states and attributes.
pbsn [-h] [-q AT_QUEUE]
-h
: Display help information.-q AT_QUEUE
: Filter nodes byresources_available.at_queue
.
- Filter by queue
lustre_scaling
:pbsn -q lustre_scaling
- Process some nodes with your own filter:
pbsnodes -a | your_filter_script | pbsn
- Summary: Total nodes and counts by state (e.g., free, down, reserved).
- Details: Node-specific attributes (state, broken, validation, comment).
pbsq
filters, sorts, and formats PBS qstat
job data.
pbsq [-h] [-f FILTER] [-s HEADER1[,OPT]:HEADER2[,OPT]:..] [-H HEADER1:HEADER2:...]
-h
: Show help.-f FILTER
: Filter jobs that matches the displayed lines. The filter is a regex supported by awk(1).-s HEADER1[,OPT]:HEADER2[,OPT]:..
: Sort by headers (e.g.,TimeRemaining,r
for reverse). OPT is one or more single letter ordering options supported in sort(1). Headers not displayed are ignored.-H HEADER1:HEADER2:...
: Display selected headers (e.g.,JobId:User:State
).
All headers: JobId
, User
, Account
, Score
, WallTime
, QueuedTime
, EstStart
, RunTime
, TimeRemaining
, Nodes
, State
, Queue
, JobName
, Location/Comments
, WorkDir
.
Note: view the displayed table with less -S
.
- Show jobs on a rack:
pbsq -f x4305 | less -S
- Sort by queued time only:
pbsq -s QueuedTime,r | less -S
- Select columns:
pbsq -H JobId:State:TimeRemaining:Nodes
- Show past jobs by a user:
qstat -xfwu username | pbsq | less -S
The user will need an evironment that provides a recent version of python and contains the requirements:
- tabulate
- pandas
- numpy
- chardet
This module provides interfaces to running pbs commands and getting output via JSON format and is used to create the utility scripts below.
Run this script to print a summary of current node status. For Example:
+----------------+-------+
| Node State | Count |
+----------------+-------+
| free | 1661 |
| in-use | 8802 |
| offline | 122 |
| in-reservation | 38 |
| Total nodes | 10624 |
+----------------+-------+
Run this script to print a summary of the node-hours queued on the local system, sorted by largest to smallest. Organized by Project, can also organize by user. For Example:
+---------------------+------------+-----------+
| project | node_hours | job_count |
+---------------------+------------+-----------+
| Project-A | 203333 | 5 |
| Project-B | 602 | 4 |
| Project-C | 44 | 11 |
| Project-D | 32 | 1 |
| Project-E | 4 | 1 |
| Project-F | 1 | 1 |
+---------------------+------------+-----------+
Run this script to print a summary of the queues and how many node-hours or jobs are on each queue. For Example:
+---------------+--------------+---------------+-------------------+--------------------+--------------+---------------+
| queue | Queued Count | Running Count | Queued Node Hours | Running Node Hours | Queued Nodes | Running Nodes |
+---------------+--------------+---------------+-------------------+--------------------+--------------+---------------+
| R4674464 | 0 | 3 | 0 | 336 | 0 | 36 |
| R4775834 | 1 | 0 | 1 | 0 | 1 | 0 |
| backfill-tiny | 1 | 5 | 768 | 7776 | 256 | 1296 |
| debug | 0 | 12 | 0 | 14 | 0 | 16 |
| debug-scaling | 0 | 2 | 0 | 3 | 0 | 3 |
| gpu_hack_prio | 0 | 1 | 0 | 1 | 0 | 1 |
| intel_maint | 4 | 5 | 602 | 208 | 19 | 10 |
| large | 5 | 0 | 203333 | 0 | 50000 | 0 |
| medium | 0 | 1 | 0 | 24576 | 0 | 2048 |
| nre-priority | 0 | 11 | 0 | 266 | 0 | 138 |
| prod | 13 | 0 | 80 | 0 | 28 | 0 |
| small | 0 | 3 | 0 | 18528 | 0 | 2056 |
| tiny | 0 | 48 | 0 | 18600 | 0 | 3766 |
| validation | 0 | 1 | 0 | 2 | 0 | 2 |
| Totals | 24 | 92 | 204784 | 70310 | 50304 | 9372 |
+---------------+--------------+---------------+-------------------+--------------------+--------------+---------------+
Run this script to print a summary of the top jobs by score in the queue. There are command line flags to filter based on job parameters. For Example:
+------------+-----------+-------+--------------+-----------------+------------------+---------------+-------+-----------+
| Job ID | User | State | Queue | Job Name | Project | Award Type | Nodes | Score |
+------------+-----------+-------+--------------+-----------------+------------------+---------------+-------+-----------+
| 4256058 | user1 | Q | large | apr-1-dbu | QuantMatManufact | INCITE | 10000 | 2086336.2 |
| 4256086 | user1 | Q | large | dec-1-dbu | QuantMatManufact | INCITE | 10000 | 2085696.7 |
| 4671183 | user2 | Q | large | submit.sh | QuantMatManufact | INCITE | 10000 | 1131868.5 |
| 4671184 | user2 | Q | large | submit.sh | QuantMatManufact | INCITE | 10000 | 1131867.4 |
| 4671185 | user2 | Q | large | submit.sh | QuantMatManufact | INCITE | 10000 | 1131863.6 |
| 4349862 | user3 | H | alcf_kmd_val | STDIN | Intel-Aurora | Discretionary | 1 | 1049.1 |
| 4683812 | user4 | Q | intel_maint | STDIN | Intel-Punchlist | Discretionary | 1 | 246.6 |
| 4775256 | user5 | R | medium | nekRS_G0p1_5000 | RBC_Conv_2 | INCITE | 2048 | 79.0 |
| 4775047[0] | user6 | Q | prod | tst | | INCITE | 1 | 56.8 |
| 4775047[1] | user6 | Q | prod | tst | | INCITE | 1 | 56.8 |
+------------+-----------+-------+--------------+-----------------+------------------+---------------+-------+-----------+