Skip to content

Conversation

varshith257
Copy link
Contributor

@varshith257 varshith257 commented Jun 5, 2025

Details

  • Introduced a new rule for detecting when Slurm's accounting daemon (slurmdbd) loses connection to its MySQL database, impacting job scheduling.
  • Added relevant log entries to demonstrate connection issues.
  • Updated categories and tags to include HPC database problems and SLURM-related tags.

Test Environment

Reproducible test setup (Maintainers invited) : Slurm Cluster
Live CRE link: CRE Playground Link
Check here for Sample Logs

Sample Detected Patterns

[2024-01-15T14:22:10] error: mysql_real_connect failed: 2002 Can't connect to MySQL server
[2024-01-15T14:22:15] error: Processing last message from connection 42 (DB connection lost)
[2024-01-15T14:22:25] error: accounting_storage/slurmdbd: Unable to connect to database

Reproduction Steps

# Start test environment
docker-compose up -d

# Simulate failure (MySQL crash)
docker stop mysql_slurm

# Monitor detection (logs update every 5s)
tail -f /var/log/slurmdbd.log

Closes #43
/claim #43

- Introduced a new rule for detecting when Slurm's accounting daemon (slurmdbd) loses connection to its MySQL database, impacting job scheduling.
- Added relevant log entries to demonstrate connection issues.
- Updated categories and tags to include HPC database problems and SLURM-related tags.
@tonymeehan tonymeehan self-requested a review June 5, 2025 17:55
@tonymeehan tonymeehan merged commit 279a893 into prequel-dev:main Jun 5, 2025
2 checks passed
@varshith257 varshith257 deleted the cre-slurm branch June 5, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Rule] Slurm: Reproduce A High-Severity Failure & Write a Detection Rule
2 participants