hyperpod-slurm-debugger

HyperPod Slurm Debugger Diagnostic-only. Identify and classify Slurm scheduler and node-daemon issues on HyperPod Slurm clusters. Do not run, recommend, or print any state-mutating command. For remediation, link to the official AWS or Slurm documentation. When to invoke Invoke when the user reports any of the symptoms in the decision table. When NOT to invoke - Cluster has — invoke or . - Single-node hardware fault with healthy Slurm scheduler — invoke . - NCCL training-hang investigation — invoke . - Node unreachable via SSM — invoke . Constraints - Read-only. Do not run, recommend, or print…

\\033[0;31m'; GREEN=

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\\033[0;32m'; YELLOW=

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\\033[1;33m'\n CYAN=

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\\033[0;36m'; BOLD=

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\\033[1m'; NC=

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\\033[0m'\nelse\n RED=''; GREEN=''; YELLOW=''; CYAN=''; BOLD=''; NC=''\nfi\n\n# All status helpers use %s with the message as a separate arg — never embed message\n# text into the format string. Strip ANSI escape sequences from incoming server data\n# so a malicious or buggy upstream cannot rewrite the operator's terminal.\n_sanitize() {\n # Drop ANSI CSI sequences and bell, but leave printable UTF-8 alone.\n sed -e 's/\\x1b\\[[0-9;?]*[a-zA-Z]//g' -e 's/\\x07//g' -e 's/\\r$//' \u003c\u003c\u003c \"${1-}\"\n}\nsection() { printf '\\n%s%s=== %s ===%s\\n' \"$BOLD\" \"$CYAN\" \"$(_sanitize \"$1\")\" \"$NC\"; }\nok() { printf ' %s[PASS]%s %s\\n' \"$GREEN\" \"$NC\" \"$(_sanitize \"$1\")\"; }\nwarn() { printf ' %s[WARN]%s %s\\n' \"$YELLOW\" \"$NC\" \"$(_sanitize \"$1\")\"; }\nbad() { printf ' %s[FAIL]%s %s\\n' \"$RED\" \"$NC\" \"$(_sanitize \"$1\")\"; }\ninfo() { printf ' %s\\n' \"$(_sanitize \"$1\")\"; }\nhint() { printf ' %s[NEXT]%s %s\\n' \"$CYAN\" \"$NC\" \"$(_sanitize \"$1\")\"; }\n\nISSUES=()\nNEXT_STEPS=()\n\n# --- Verify cluster + orchestrator --------------------------------------------\nsection \"1. Cluster identity\"\nDESC=$(aws sagemaker describe-cluster --cluster-name \"$CLUSTER\" --region \"$REGION\" \\\n --output json 2>&1) || { bad \"cannot describe cluster: $DESC\"; exit 1; }\n\nORCH=$(jq -r '.Orchestrator // {} | keys[0] // \"Slurm\"' \u003c\u003c\u003c \"$DESC\")\nif [[ \"$ORCH\" == \"Eks\" ]]; then\n bad \"cluster uses EKS orchestrator - this skill is for Slurm only\"\n info \"use hyperpod-node-debugger or hyperpod-nccl instead\"\n exit 1\nfi\n\n# Managed Slurm vs self-managed Slurm:\n# - Managed: DescribeCluster.Orchestrator.Slurm is present AND the cluster was created\n# with the SlurmConfig API parameter — InstanceGroups[].SlurmConfig.NodeType identifies\n# controllers, login nodes, workers. AWS docs treat this as the authoritative source.\n# - Self-managed: anything else. The customer brought their own Slurm setup via the\n# lifecycle scripts and InstanceGroups[].SlurmConfig is empty. The controller-group\n# name lives in /opt/ml/config/provisioning_parameters.json on every node, or the\n# customer can pass --controller-group \u003cNAME>.\nHAS_SLURM_CONFIG=$(jq -r '\n any(.InstanceGroups[]?; (.SlurmConfig // {}) != {})\n' \u003c\u003c\u003c \"$DESC\")\nCLUSTER_NAME=$(jq -r '.ClusterName // \"unknown\"' \u003c\u003c\u003c \"$DESC\")\nCLUSTER_STATUS=$(jq -r '.ClusterStatus // \"unknown\"' \u003c\u003c\u003c \"$DESC\")\nif [[ \"$HAS_SLURM_CONFIG\" == \"true\" ]]; then\n ok \"Managed Slurm cluster: $CLUSTER_NAME status=$CLUSTER_STATUS\"\nelse\n ok \"Self-managed Slurm cluster: $CLUSTER_NAME status=$CLUSTER_STATUS\"\nfi\n\n# Cluster ID from ARN. Validate before it gets embedded into SSM target strings.\nCLUSTER_ID=$(jq -r '.ClusterArn // \"\" | split(\"/\") | last' \u003c\u003c\u003c \"$DESC\")\n[[ -n \"$CLUSTER_ID\" ]] || { bad \"cannot extract cluster ID from ARN\"; exit 1; }\nCLUSTER_ID=$(validate_cluster_id \"$CLUSTER_ID\")\n\n# --- SSM remote-execution helper ----------------------------------------------\n#\n# `ssm_run` runs a command on a HyperPod node via SSM (read-only).\n#\n# Design notes:\n# 1. The remote script is base64-encoded locally and decoded remotely. The agent's\n# command parameter is a fixed `sh -c \"echo \u003cBASE64> | base64 -d | bash\"`; the\n# base64 string contains only [A-Za-z0-9+/=] and is safe inside double quotes.\n# Nothing from the script's caller appears unescaped in the SSM-agent's argv.\n# 2. Server-derived values that need to be visible to the remote script are passed\n# as named environment variables (`VAR=VALUE` trailing args). Each value is run\n# through `jq @sh` (single-quoted shell-safe encoding with `'\\''` escapes) and\n# prepended to the remote script as `export VAR='\u003csafely-quoted>'; ...`. The remote\n# shell reads them as `$NODE`, `$NODELIST`, etc. — values never reach a remote\n# shell-eval context as raw interpolated text.\n# 3. `unbuffer` is required to defeat the SSM \"Cannot perform start session: EOF\"\n# race; the prerequisite check above guarantees it's present.\n# 4. Returns the underlying aws-cli exit code so callers can distinguish transport\n# failures from successful empty output.\n#\n# Usage:\n# ssm_run TARGET REMOTE_SCRIPT [VAR=VALUE ...]\nssm_run() {\n local target=\"$1\"; shift\n local script=\"$1\"; shift\n local export_block=\"\" raw_kv key val safe_val\n for raw_kv in \"$@\"; do\n [[ \"$raw_kv\" =~ ^([A-Za-z_][A-Za-z0-9_]*)=(.*)$ ]] || {\n echo \"ssm_run: invalid VAR=VALUE: $raw_kv\" >&2\n return 2\n }\n key=\"${BASH_REMATCH[1]}\"\n val=\"${BASH_REMATCH[2]}\"\n # jq's @sh produces single-quoted shell-safe text with embedded `'\\''` escapes.\n safe_val=$(jq -nr --arg v \"$val\" '$v | @sh')\n export_block+=\"export ${key}=${safe_val}; \"\n done\n local full_script=\"${export_block}${script}\"\n local b64\n if base64 --help 2>&1 | grep -q '\\-w'; then\n b64=$(printf '%s' \"$full_script\" | base64 -w0)\n else\n b64=$(printf '%s' \"$full_script\" | base64 -b0)\n fi\n local wrapper=\"sh -c \\\"echo $b64 | base64 -d | bash\\\"\"\n local params\n params=$(jq -nc --arg c \"$wrapper\" '{command: [$c]}')\n local out rc=0\n out=$(unbuffer aws ssm start-session --region \"$REGION\" --target \"$target\" \\\n --document-name AWS-StartNonInteractiveCommand \\\n --parameters \"$params\" 2>&1) || rc=$?\n # NOTE: do NOT strip 'Cannot perform start session' here — that line is the\n # SSM transport-failure signal that ssm_transport_failed() detects. Only filter\n # benign session chrome ('Starting session' / 'Exiting session') and ANSI escapes.\n printf '%s' \"$out\" \\\n | sed -e 's/\\x1b\\[[0-9;?]*[a-zA-Z]//g' \\\n -e '/^Starting session/d' \\\n -e '/^Exiting session/d'\n return \"$rc\"\n}\n\n# Returns 0 if the SSM raw output indicates a transport-layer failure (no command\n# output, session refused, EOF before flush) — distinct from \"command ran and returned\n# nothing.\" Used to bail out early rather than misreport every downstream check.\nssm_transport_failed() {\n local raw=\"${1-}\"\n grep -qiE 'Cannot perform start session|TargetNotConnected|InvalidTarget|AccessDeniedException|UnauthorizedOperation' \u003c\u003c\u003c \"$raw\"\n}\n\n# --- Find controller node -----------------------------------------------------\nNODES_JSON=$(aws sagemaker list-cluster-nodes --cluster-name \"$CLUSTER\" --region \"$REGION\" \\\n --output json 2>&1) || { bad \"list-cluster-nodes failed: $NODES_JSON\"; exit 1; }\n\n# Discovery priority:\n# 1. --controller-group \u003cNAME> (operator override — always wins)\n# 2. InstanceGroups[].SlurmConfig.NodeType == \"Controller\" (managed-Slurm authoritative)\n# 3. /opt/ml/config/provisioning_parameters.json on a probe node (self-managed fallback)\n# 4. Refuse to guess — print available groups and exit.\n# We never guess based on instance-group naming — that's a lifecycle-script convention,\n# not a guarantee, and getting it wrong sends every command to a non-controller.\nCONTROLLER_GROUP=\"\"\nCONTROLLER_DISCOVERY_METHOD=\"\"\n\n# (1) Operator override — always wins.\nif [[ -n \"$CONTROLLER_GROUP_OVERRIDE\" ]]; then\n CONTROLLER_GROUP=\"$CONTROLLER_GROUP_OVERRIDE\"\n CONTROLLER_DISCOVERY_METHOD=\"--controller-group flag\"\nfi\n\n# (2) Managed-Slurm authoritative source.\nif [[ -z \"$CONTROLLER_GROUP\" && \"$HAS_SLURM_CONFIG\" == \"true\" ]]; then\n CONTROLLER_GROUP=$(jq -r '\n .InstanceGroups[]?\n | select((.SlurmConfig.NodeType // \"\") == \"Controller\")\n | .InstanceGroupName' \u003c\u003c\u003c \"$DESC\" | head -1)\n if [[ -n \"$CONTROLLER_GROUP\" ]]; then\n CONTROLLER_DISCOVERY_METHOD=\"DescribeCluster.SlurmConfig\"\n fi\nfi\n\n# (3) Self-managed: read provisioning_parameters.json from any node.\n# The lifecycle-script convention is that this file is dropped at the same path on every\n# node, so we pick any node arbitrarily, SSM in, and read the controller_group field.\nif [[ -z \"$CONTROLLER_GROUP\" ]]; then\n PROBE_ID=$(jq -r '.ClusterNodeSummaries[0].InstanceId // \"\"' \u003c\u003c\u003c \"$NODES_JSON\")\n PROBE_GROUP=$(jq -r '.ClusterNodeSummaries[0].InstanceGroupName // \"\"' \u003c\u003c\u003c \"$NODES_JSON\")\n if [[ -n \"$PROBE_ID\" && -n \"$PROBE_GROUP\" ]]; then\n PROBE_ID_V=$(validate_instance_id \"$PROBE_ID\")\n PROBE_GROUP_V=$(validate_group_name \"$PROBE_GROUP\")\n PROBE_TARGET=\"sagemaker-cluster:${CLUSTER_ID}_${PROBE_GROUP_V}-${PROBE_ID_V}\"\n # Field name varies between lifecycle-script generations — try both.\n PROV_GROUP=$(ssm_run \"$PROBE_TARGET\" \\\n 'jq -r \".controller_group // .ControllerGroup // empty\" /opt/ml/config/provisioning_parameters.json 2>/dev/null' \\\n 2>/dev/null | tr -d '\\r\\n' || true)\n if [[ -n \"$PROV_GROUP\" ]]; then\n MATCHED=$(jq -r --arg g \"$PROV_GROUP\" \\\n '[.ClusterNodeSummaries[]? | select(.InstanceGroupName == $g)] | length' \u003c\u003c\u003c \"$NODES_JSON\")\n if [[ \"$MATCHED\" -gt 0 ]]; then\n CONTROLLER_GROUP=\"$PROV_GROUP\"\n CONTROLLER_DISCOVERY_METHOD=\"provisioning_parameters.json on $PROBE_ID_V\"\n fi\n fi\n fi\nfi\n\n# (4) Out of options — refuse to guess. Tell the operator how to unblock.\nif [[ -z \"$CONTROLLER_GROUP\" ]]; then\n bad \"cannot identify the Slurm controller instance group\"\n if [[ \"$HAS_SLURM_CONFIG\" == \"true\" ]]; then\n info \"no InstanceGroup has SlurmConfig.NodeType=Controller in DescribeCluster output\"\n info \"this is unexpected for a managed-Slurm cluster — verify the cluster was\"\n info \"created with the SlurmConfig parameter, or pass --controller-group \u003cNAME>.\"\n else\n info \"self-managed Slurm cluster — provisioning_parameters.json was not readable\"\n info \"from a probe node, and no --controller-group flag was provided.\"\n info \"\"\n info \"Resolve by either:\"\n info \" 1. inspecting the head node manually:\"\n info \" aws ssm start-session --target $PROBE_TARGET --region $REGION\"\n info \" cat /opt/ml/config/provisioning_parameters.json | jq .\"\n info \" 2. re-running with the controller group's name:\"\n info \" --controller-group \u003cINSTANCE_GROUP_NAME>\"\n info \"\"\n info \"Available instance groups in this cluster:\"\n jq -r '.ClusterNodeSummaries[] | \" - \" + .InstanceGroupName + \" (\" + .InstanceId + \")\"' \\\n \u003c\u003c\u003c \"$NODES_JSON\" | sort -u\n fi\n exit 1\nfi\nCONTROLLER_GROUP=$(validate_group_name \"$CONTROLLER_GROUP\")\n\n# Pick the first node from the controller group.\nCONTROLLER_ID=$(jq -r --arg g \"$CONTROLLER_GROUP\" \\\n '.ClusterNodeSummaries[]? | select(.InstanceGroupName == $g) | .InstanceId' \u003c\u003c\u003c \"$NODES_JSON\" | head -1)\n[[ -n \"$CONTROLLER_ID\" ]] || { bad \"controller group $CONTROLLER_GROUP has no nodes\"; exit 1; }\nCONTROLLER_ID=$(validate_instance_id \"$CONTROLLER_ID\")\n\nok \"controller node: $CONTROLLER_ID (group=$CONTROLLER_GROUP, source=$CONTROLLER_DISCOVERY_METHOD)\"\n\nSSM_HEAD=\"sagemaker-cluster:${CLUSTER_ID}_${CONTROLLER_GROUP}-${CONTROLLER_ID}\"\n\n# --- Collect Slurm state from head node ---------------------------------------\nsection \"2. Slurm cluster state (from head node)\"\nSSM_PROBE=$(ssm_run \"$SSM_HEAD\" 'echo SSM_OK' || true)\nif ! grep -q '^SSM_OK

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.

\u003c\u003c\u003c \"$SSM_PROBE\"; then\n bad \"cannot reach head node via SSM — every downstream check would be unreliable\"\n if ssm_transport_failed \"$SSM_PROBE\"; then\n info \" transport error detected (TargetNotConnected, AccessDenied, or EOF race)\"\n fi\n info \" reproduce manually with the same target and region:\"\n info \" aws ssm start-session --target $SSM_HEAD --region $REGION\"\n info \" if that fails, route to the hyperpod-ssm skill before retrying.\"\n exit 1\nfi\nok \"SSM transport to head node working\"\n\nSINFO_OUT=$(ssm_run \"$SSM_HEAD\" 'sinfo -h -o \"%N|%T|%E\" 2>&1 | head -200' || true)\nif [[ $(printf '%s\\n' \"$SINFO_OUT\" | wc -l) -ge 200 ]]; then\n warn \"sinfo output reached the 200-line cap — node-state results may be truncated on this large cluster\"\nfi\nif grep -qi 'command not found' \u003c\u003c\u003c \"$SINFO_OUT\"; then\n bad \"sinfo not installed on head node — Slurm lifecycle script may not have run\"\n info \"verify on the node: systemctl status slurmctld; ls /opt/slurm*/etc /etc/slurm 2>/dev/null\"\n exit 1\nfi\nif [[ -z \"$SINFO_OUT\" ]]; then\n warn \"sinfo returned no rows — empty cluster, or controller not yet responding\"\nfi\n\n# Parse sinfo lines. Node names from sinfo are server-controlled; validate before they\n# can be embedded into any later command. Values that fail validation are dropped, not\n# trusted; we report the count of skipped entries so the operator notices.\nDOWN_NODES=()\nREBOOT_NODES=()\nFAIL_NODES=()\nBAD_REASON_NODES=()\nSKIPPED_INVALID=0\nwhile IFS='|' read -r node state reason; do\n [[ -z \"$node\" ]] && continue\n if ! [[ \"$node\" =~ ^[a-zA-Z0-9._-]{1,253}$ ]]; then\n SKIPPED_INVALID=$((SKIPPED_INVALID+1))\n continue\n fi\n # Reasons can contain spaces and punctuation; allow them but strip ANSI/control chars.\n reason=\"$(_sanitize \"$reason\")\"\n if grep -qi 'fail' \u003c\u003c\u003c \"$state\"; then\n if [[ \"$reason\" =~ ^Action:(Reboot|Replace)$ ]]; then\n FAIL_NODES+=(\"$node|$reason\")\n elif grep -qiE 'action[ :_-]*re(boot|place)|reboot|replace' \u003c\u003c\u003c \"$reason\"; then\n BAD_REASON_NODES+=(\"$node|$reason\")\n fi\n fi\n if grep -qiE 'down|drain' \u003c\u003c\u003c \"$state\"; then\n if grep -qi 'unexpectedly rebooted' \u003c\u003c\u003c \"$reason\"; then\n REBOOT_NODES+=(\"$node\")\n else\n DOWN_NODES+=(\"$node|$reason\")\n fi\n fi\ndone \u003c\u003c\u003c \"$SINFO_OUT\"\n[[ \"$SKIPPED_INVALID\" -gt 0 ]] && warn \"$SKIPPED_INVALID sinfo row(s) had invalid node names and were ignored\"\n\nif [[ ${#DOWN_NODES[@]} -eq 0 && ${#REBOOT_NODES[@]} -eq 0 && ${#FAIL_NODES[@]} -eq 0 && ${#BAD_REASON_NODES[@]} -eq 0 ]]; then\n ok \"all nodes in healthy Slurm states\"\nelse\n [[ ${#DOWN_NODES[@]} -gt 0 ]] && bad \"${#DOWN_NODES[@]} node(s) DOWN/DRAIN (Section A)\"\n [[ ${#REBOOT_NODES[@]} -gt 0 ]] && bad \"${#REBOOT_NODES[@]} node(s) with 'unexpectedly rebooted' (Section B)\"\n [[ ${#FAIL_NODES[@]} -gt 0 ]] && warn \"${#FAIL_NODES[@]} node(s) in fail state with valid Action:* reason (HyperPod recovery in progress)\"\n [[ ${#BAD_REASON_NODES[@]} -gt 0 ]] && bad \"${#BAD_REASON_NODES[@]} node(s) in fail state with non-matching reason (Section D)\"\nfi\n\n# --- Section D: Action:* reason-string validation -----------------------------\nif [[ ${#BAD_REASON_NODES[@]} -gt 0 ]]; then\n section \"D. Reason-string mismatch — HyperPod auto-recovery will NOT trigger\"\n for entry in \"${BAD_REASON_NODES[@]}\"; do\n n=\"${entry%%|*}\"; r=\"${entry#*|}\"\n bad \"$n: reason='$r'\"\n done\n info \"the reason field must match exactly: Action:Reboot or Action:Replace\"\n info \"(case-sensitive, no spaces, no trailing punctuation)\"\n hint \"for re-issue procedure, see:\"\n info \" https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html\"\n info \" references/slurm-details.md#action-reason-string-validation\"\n ISSUES+=(\"bad-action-reason\")\n NEXT_STEPS+=(\"see AWS replace-faulty-instance docs (link above)\")\nfi\n\n# --- Detect in-progress HyperPod replacements (informational) -----------------\nif [[ ${#FAIL_NODES[@]} -gt 0 ]]; then\n section \" HyperPod recovery in progress (do not interfere)\"\n for entry in \"${FAIL_NODES[@]}\"; do\n n=\"${entry%%|*}\"; r=\"${entry#*|}\"\n info \"$n ($r)\"\n done\n info \"AWS docs: do NOT change node state or restart slurmctld until this completes.\"\n info \"If a replacement seems stuck > 30 min, see:\"\n info \" https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html\"\nfi\n\n# --- Check controller health --------------------------------------------------\nsection \"3. slurmctld health\"\nPING_OUT=$(ssm_run \"$SSM_HEAD\" 'scontrol ping 2>&1' || true)\nPING_FIRST_LINE=$(head -1 \u003c\u003c\u003c \"$PING_OUT\" | tr -d '\\r')\nif grep -qi 'UP' \u003c\u003c\u003c \"$PING_OUT\"; then\n ok \"slurmctld responding: $(tr '\\n' ' ' \u003c\u003c\u003c \"$PING_OUT\")\"\nelif [[ -z \"$PING_OUT\" ]] || ssm_transport_failed \"$PING_OUT\"; then\n warn \"could not get a response from scontrol ping — cannot determine controller health\"\n info \"this is most likely an SSM transport problem, not a hung controller\"\n info \"do NOT restart slurmctld based on this finding alone\"\nelif grep -qi 'DOWN' \u003c\u003c\u003c \"$PING_OUT\"; then\n bad \"slurmctld reports DOWN: $PING_FIRST_LINE\"\n ISSUES+=(\"controller-hung\")\n NEXT_STEPS+=(\"controller restart — see references/slurm-details.md#-c-controller-state--diagnostic-context\")\nelse\n bad \"slurmctld responded with an unrecognized status: $PING_FIRST_LINE\"\n ISSUES+=(\"controller-hung\")\n NEXT_STEPS+=(\"inspect logs first; controller restart only if logs confirm a hang\")\nfi\n\n# --- Section C-1: slurmdbd connectivity (controller-state restart trigger) ---\nsection \"C (slurmdbd): accounting daemon connectivity\"\nDBD_OUT=$(ssm_run \"$SSM_HEAD\" 'sacctmgr -i show stats 2>&1 | head -20' || true)\nif grep -qiE 'unable to contact|connection refused|cannot connect|no slurmdbd' \u003c\u003c\u003c \"$DBD_OUT\"; then\n bad \"slurmctld cannot reach slurmdbd\"\n info \"$(head -3 \u003c\u003c\u003c \"$DBD_OUT\")\"\n hint \"diagnostic and recovery procedure:\"\n info \" https://slurm.schedmd.com/accounting.html\"\n info \" references/slurm-details.md#slurmdbd-connectivity\"\n ISSUES+=(\"slurmdbd-disconnected\")\n NEXT_STEPS+=(\"restore slurmdbd connectivity (see AWS / Slurm docs linked above)\")\nelif grep -qiE 'rollup|rpc' \u003c\u003c\u003c \"$DBD_OUT\"; then\n ok \"slurmdbd reachable\"\nelse\n warn \"could not determine slurmdbd state from sacctmgr output\"\n info \"if accounting is configured, run on the head node: sacctmgr show stats\"\nfi\n\n# --- Section C-2: pending slurm.conf reconfiguration (controller-state restart trigger) ---\n# HyperPod's slurm.conf lives at /opt/slurm-\u003cversion>/etc/slurm.conf rather than the\n# upstream /etc/slurm/slurm.conf, so the remote script asks scontrol where the live\n# config is. The output is a `\u003cconf-mtime>|\u003cctld-start>|\u003cconf-path>` line that we\n# match strictly with a regex before parsing.\nsection \"C (config): slurm.conf freshness\"\nread -r -d '' F_REMOTE \u003c\u003c'REMOTE_F' || true\nset -e\n# nosemgrep: bash.lang.correctness.unquoted-expansion.unquoted-variable-expansion-in-command\n_CONF=\"$(scontrol show config 2>/dev/null | awk -F= '/^SLURM_CONF/ {gsub(/ /,\"\",$2); print $2; exit}')\"\nCONF_MTIME=0\nif [ -n \"$_CONF\" ] && [ -r \"$_CONF\" ]; then\n CONF_MTIME=$(stat -c %Y \"$_CONF\" 2>/dev/null || echo 0)\nfi\nCTLD_TS=$(systemctl show slurmctld -p ActiveEnterTimestamp --value 2>/dev/null || true)\nCTLD_START=0\nif [ -n \"$CTLD_TS\" ]; then\n CTLD_START=$(date -d \"$CTLD_TS\" +%s 2>/dev/null || echo 0)\nfi\nprintf 'F_RESULT|%s|%s|%s\\n' \"${CONF_MTIME}\" \"${CTLD_START}\" \"${_CONF}\"\nREMOTE_F\nF_LINE=$(ssm_run \"$SSM_HEAD\" \"$F_REMOTE\" 2>/dev/null | grep -E '^F_RESULT\\|[0-9]+\\|[0-9]+\\|' | head -1 || true)\nif [[ \"$F_LINE\" =~ ^F_RESULT\\|([0-9]+)\\|([0-9]+)\\|(.*)$ ]]; then\n CONF_MTIME=\"${BASH_REMATCH[1]}\"\n CTLD_START=\"${BASH_REMATCH[2]}\"\n CONF_PATH=\"${BASH_REMATCH[3]}\"\n # CONF_PATH must be a real-looking absolute path before we put it into operator-\n # facing recommendations. Reject anything that has shell-active characters.\n if ! [[ \"$CONF_PATH\" =~ ^/[A-Za-z0-9._/-]+$ ]]; then\n warn \"slurm.conf path returned by remote did not validate; skipping freshness check\"\n elif [[ \"$CONF_MTIME\" -gt \"$CTLD_START\" && \"$CTLD_START\" -gt 0 ]]; then\n DELTA=$((CONF_MTIME - CTLD_START))\n warn \"$CONF_PATH modified ${DELTA}s after slurmctld last started — config may be stale in memory\"\n hint \"for the reload-vs-restart decision and procedure, see:\"\n info \" https://slurm.schedmd.com/scontrol.html\"\n info \" https://slurm.schedmd.com/slurm.conf.html\"\n info \" references/slurm-details.md#scontrol-reconfigure-vs-restart\"\n ISSUES+=(\"stale-conf\")\n NEXT_STEPS+=(\"review reload procedure in linked docs\")\n else\n ok \"slurm.conf older than slurmctld start time — no pending reconfigure\"\n fi\nelse\n warn \"could not determine slurm.conf vs slurmctld timestamps\"\nfi\n\n# --- Check for stuck jobs -----------------------------------------------------\nsection \"4. Job queue health\"\nSQUEUE_OUT=$(ssm_run \"$SSM_HEAD\" 'squeue -h -o \"%i|%T|%r\" 2>&1 | head -200' || true)\nif [[ $(printf '%s\\n' \"$SQUEUE_OUT\" | wc -l) -ge 200 ]]; then\n warn \"squeue output reached the 200-line cap — stuck-job counts below may underreport on this large cluster\"\nfi\nSTUCK_PENDING=0\nSTUCK_COMPLETING=0\nwhile IFS='|' read -r jobid state reason; do\n [[ -z \"$jobid\" ]] && continue\n [[ \"$state\" == \"PENDING\" && \"$reason\" == \"Resources\" ]] && STUCK_PENDING=$((STUCK_PENDING+1))\n [[ \"$state\" == \"COMPLETING\" ]] && STUCK_COMPLETING=$((STUCK_COMPLETING+1))\ndone \u003c\u003c\u003c \"$SQUEUE_OUT\"\n\nif [[ $STUCK_PENDING -gt 0 ]]; then\n warn \"$STUCK_PENDING job(s) PENDING with Reason=Resources\"\n if [[ ${#DOWN_NODES[@]} -eq 0 ]]; then\n ISSUES+=(\"stuck-pending-with-idle-nodes\")\n NEXT_STEPS+=(\"controller restart — Section C\")\n fi\nfi\nif [[ $STUCK_COMPLETING -gt 0 ]]; then\n bad \"$STUCK_COMPLETING job(s) stuck in COMPLETING\"\n ISSUES+=(\"stuck-completing\")\n NEXT_STEPS+=(\"controller restart — Section C\")\nfi\n[[ $STUCK_PENDING -eq 0 && $STUCK_COMPLETING -eq 0 ]] && ok \"no stuck jobs\"\n\n# --- Per-node inspection (read-only) ------------------------------------------\ninspect_node() {\n local slurm_node=\"$1\"\n # Defense-in-depth: validate again at the boundary even though all upstream paths\n # validate. Cheap, and catches future refactors that miss a callsite.\n slurm_node=$(validate_node_name \"$slurm_node\")\n\n local instance_id group ssm_target\n # PrivateDnsName looks like `ip-10-1-2-3.us-west-2.compute.internal`. The strict\n # `\u003cname>.` match handles the default `ip-x-x-x-x` form and rejects the false\n # positive where node `ip-10-1-2-3` would otherwise also match\n # `ip-10-1-2-30.\u003cregion>.compute.internal`.\n instance_id=$(jq -r --arg dns \"$slurm_node\" '\n .ClusterNodeSummaries[]?\n | select((.PrivateDnsName // \"\") | startswith($dns + \".\"))\n | .InstanceId' \u003c\u003c\u003c \"$NODES_JSON\" | head -1)\n if [[ -z \"$instance_id\" ]]; then\n if [[ ! \"$slurm_node\" =~ ^ip-[0-9]+-[0-9]+-[0-9]+-[0-9]+$ ]]; then\n warn \"$slurm_node: not in the default ip-X-X-X-X form — Slurm-node-name → instance-ID auto-mapping needs DNS lookup or scontrol show node, neither cheap from here. Pass --target-instance-id \u003ci-xxx> if you have it, or look up via 'scontrol show node $slurm_node | grep NodeAddr' on the controller.\"\n else\n warn \"$slurm_node: cannot map to instance ID (PrivateDnsName mismatch — verify node is in this cluster)\"\n fi\n return\n fi\n instance_id=$(validate_instance_id \"$instance_id\")\n\n group=$(jq -r --arg id \"$instance_id\" \\\n '.ClusterNodeSummaries[] | select(.InstanceId==$id) | .InstanceGroupName // \"\"' \u003c\u003c\u003c \"$NODES_JSON\")\n group=$(validate_group_name \"$group\")\n ssm_target=\"sagemaker-cluster:${CLUSTER_ID}_${group}-${instance_id}\"\n\n local slurmd_status disk mem rpc_check\n slurmd_status=$(ssm_run \"$ssm_target\" 'systemctl is-active slurmd 2>&1' | tr -d '\\r\\n' || true)\n disk=$(ssm_run \"$ssm_target\" 'df -h / | awk \"NR==2 {print \\$5}\"' | tr -d '\\r\\n' || true)\n mem=$(ssm_run \"$ssm_target\" 'free -h | awk \"/Mem:/ {print \\$3\\\"/\\\"\\$2}\"' | tr -d '\\r\\n' || true)\n\n # Slurm-RPC reachability: srun -w \"$NODE\" hostname. The remote script reads $NODE\n # from the environment, so the slurm node name is never string-interpolated into\n # the remote shell — it lives in env-var space the whole way.\n rpc_check=$(ssm_run \"$SSM_HEAD\" 'timeout 10 srun --immediate=5 -w \"$NODE\" hostname 2>&1 | tail -1' \\\n \"NODE=$slurm_node\" | tr -d '\\r\\n' || true)\n\n info \"$slurm_node ($instance_id): slurmd=$slurmd_status disk=$disk mem=$mem\"\n info \" srun RPC: ${rpc_check:-\u003cno output>}\"\n\n local disk_num=\"${disk%\\%}\"\n if [[ \"$disk_num\" =~ ^[0-9]+$ && \"$disk_num\" -ge 95 ]]; then\n bad \" $slurm_node: root volume ${disk} — clean up before any restart\"\n info \" HyperPod storage layout: https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md\"\n ISSUES+=(\"disk-full-$slurm_node\")\n NEXT_STEPS+=(\"clean disk on $slurm_node before recovery\")\n fi\n if [[ \"$slurmd_status\" != \"active\" ]]; then\n bad \" $slurm_node: slurmd is '$slurmd_status'\"\n info \" for recovery procedure, see:\"\n info \" https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md\"\n fi\n if [[ -n \"$rpc_check\" ]] && grep -qiE 'auth|munge|invalid' \u003c\u003c\u003c \"$rpc_check\"; then\n bad \" $slurm_node: srun reports auth/munge error — slurmd-controller trust broken\"\n info \" for munge troubleshooting, see Slurm authentication docs:\"\n info \" https://slurm.schedmd.com/authentication.html\"\n fi\n}\n\nif [[ -n \"$TARGET_NODE\" ]]; then\n section \"5. Inspecting node: $TARGET_NODE\"\n inspect_node \"$TARGET_NODE\"\nelif [[ ${#DOWN_NODES[@]} -gt 0 || ${#REBOOT_NODES[@]} -gt 0 ]]; then\n section \"5. Inspecting affected nodes\"\n for entry in \"${DOWN_NODES[@]-}\"; do\n [[ -z \"$entry\" ]] && continue\n inspect_node \"${entry%%|*}\"\n done\n for n in \"${REBOOT_NODES[@]-}\"; do\n [[ -z \"$n\" ]] && continue\n inspect_node \"$n\"\n done\nfi\n\n# --- Section E: HyperPod auto-resume support + recent missed-resume detection ---\nsection \"E. Auto-resume support\"\n\nAR_HELP=$(ssm_run \"$SSM_HEAD\" 'srun --help 2>&1 | grep -i auto-resume | head -3' || true)\nif [[ -n \"$AR_HELP\" ]]; then\n ok \"srun --auto-resume is available on this cluster\"\nelse\n warn \"srun --auto-resume not found in srun --help output\"\n info \"this AMI / Slurm build may predate HyperPod auto-resume support\"\n info \"see: references/slurm-details.md#hyperpod-auto-resume\"\n ISSUES+=(\"auto-resume-unsupported\")\n NEXT_STEPS+=(\"upgrade the cluster AMI / Slurm package to enable --auto-resume\")\nfi\n\nread -r -d '' G_FAILS \u003c\u003c'REMOTE_G' || true\nsacct -X -n --starttime=now-6hours \\\n -o JobID,State,ExitCode,NodeList \\\n --state=NODE_FAIL,FAILED 2>/dev/null \\\n | awk 'NF>=4 && $4!~/None/ {print $1\"|\"$2\"|\"$4}' | head -50\nREMOTE_G\nRECENT_FAILS=$(ssm_run \"$SSM_HEAD\" \"$G_FAILS\" 2>/dev/null || true)\n\nMISSED_AR=()\nNOW_EPOCH=$(date +%s)\nwhile IFS='|' read -r jobid state nodelist; do\n [[ -z \"$jobid\" ]] && continue\n # Only single-node failures — multi-node lists need a real range expander.\n [[ \"$nodelist\" == *,* || \"$nodelist\" == *\\[* ]] && continue\n # Validate before passing to remote.\n if ! [[ \"$nodelist\" =~ ^[a-zA-Z0-9._-]{1,253}$ ]]; then\n continue\n fi\n # A successful HyperPod replace clears the node's Reason field once the new instance\n # registers, so grepping for \"Action:Replace\" is unreliable. Detect a recent replace\n # by comparing scontrol show node's BootTime to wall-clock: a fresh BootTime within\n # the last 6h that's later than the failed-job's End time strongly suggests the node\n # was replaced (or rebooted) after the job died.\n BOOT_LINE=$(ssm_run \"$SSM_HEAD\" 'scontrol show node \"$NODE\" 2>/dev/null | tr \" \" \"\\n\" | grep \"^BootTime=\"' \\\n \"NODE=$nodelist\" | head -1 | tr -d '\\r\\n' || true)\n BOOT_STR=\"${BOOT_LINE#BootTime=}\"\n [[ -z \"$BOOT_STR\" || \"$BOOT_STR\" == \"Unknown\" ]] && continue\n BOOT_EPOCH=$(date -d \"$BOOT_STR\" +%s 2>/dev/null || echo 0)\n [[ \"$BOOT_EPOCH\" =~ ^[0-9]+$ && \"$BOOT_EPOCH\" -gt 0 ]] || continue\n AGE=$((NOW_EPOCH - BOOT_EPOCH))\n if [[ $AGE -ge 0 && $AGE -le 21600 ]]; then # 6h window\n MISSED_AR+=(\"$jobid|$state|$nodelist|$BOOT_STR\")\n fi\ndone \u003c\u003c\u003c \"$RECENT_FAILS\"\n\nif [[ ${#MISSED_AR[@]} -gt 0 ]]; then\n bad \"${#MISSED_AR[@]} recent job(s) failed on a node that was rebooted/replaced shortly after — possible missed auto-resume:\"\n for entry in \"${MISSED_AR[@]}\"; do\n IFS='|' read -r jobid state nodelist boot \u003c\u003c\u003c \"$entry\"\n info \" job $jobid ($state) on $nodelist (node BootTime=$boot)\"\n done\n info \"(heuristic: node BootTime is within the last 6h, suggesting a replace or reboot)\"\n hint \"verify the launch command used srun --auto-resume=1 (NOT just sbatch):\"\n info \" sacct -j \u003cJOBID> -o JobID,JobName,Submit,Start,End,State,ExitCode,NodeList -X\"\n info \" scontrol show job \u003cJOBID> # only if still in the controller's recent history\"\n info \"see: references/slurm-details.md#hyperpod-auto-resume\"\n ISSUES+=(\"missed-auto-resume\")\n NEXT_STEPS+=(\"verify --auto-resume=1 is on the srun line, not just sbatch\")\nelif [[ -n \"$RECENT_FAILS\" ]]; then\n ok \"recent failed jobs do not match the missed-auto-resume pattern\"\nelse\n ok \"no recent NODE_FAIL / FAILED jobs in the last 6h\"\nfi\n\n# --- Findings → documentation links ------------------------------------------\n# This skill is diagnostic-only. It never prints a remediation command. For each\n# finding, point the user at the authoritative doc and let them act.\nsection \"Where to read next\"\n\nif [[ ${#REBOOT_NODES[@]} -gt 0 ]]; then\n hint \"Section B — nodes flagged 'unexpectedly rebooted':\"\n for n in \"${REBOOT_NODES[@]}\"; do\n info \" $n\"\n done\n info \" HyperPod Slurm troubleshooting:\"\n info \" https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md\"\n info \" diagnostic context: references/slurm-details.md#-b-unexpected-reboot--diagnostic-context\"\nfi\n\nif [[ ${#DOWN_NODES[@]} -gt 0 ]]; then\n hint \"Section A — nodes DOWN/DRAIN:\"\n for entry in \"${DOWN_NODES[@]}\"; do\n n=\"${entry%%|*}\"; r=\"${entry#*|}\"\n info \" $n (reason: $r)\"\n done\n info \" HyperPod Slurm troubleshooting:\"\n info \" https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md\"\n info \" if the node flaps after a manual recovery → route to hyperpod-node-debugger\"\nfi\n\nCTRL_RESTART_REASON=\"\"\nISSUES_STR=\" ${ISSUES[*]-} \"\n[[ \"$ISSUES_STR\" == *\" controller-hung \"* ]] && CTRL_RESTART_REASON=\"scontrol ping failed\"\n[[ \"$ISSUES_STR\" == *\" stuck-completing \"* ]] && CTRL_RESTART_REASON=\"${CTRL_RESTART_REASON:+$CTRL_RESTART_REASON, }jobs stuck COMPLETING\"\n[[ \"$ISSUES_STR\" == *\" stuck-pending-with-idle-nodes \"* ]] && CTRL_RESTART_REASON=\"${CTRL_RESTART_REASON:+$CTRL_RESTART_REASON, }jobs PENDING with idle nodes\"\n\nif [[ -n \"$CTRL_RESTART_REASON\" ]]; then\n hint \"Section C — controller-state issue ($CTRL_RESTART_REASON):\"\n info \" Slurm slurmctld(8) — for what is preserved across a controller restart:\"\n info \" https://slurm.schedmd.com/slurmctld.html\"\n info \" HyperPod Slurm troubleshooting:\"\n info \" https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md\"\n if [[ ${#FAIL_NODES[@]} -gt 0 ]]; then\n warn \"HyperPod recovery is in progress on:\"\n for entry in \"${FAIL_NODES[@]}\"; do\n n=\"${entry%%|*}\"\n info \" $n\"\n done\n info \"AWS docs warn against changing node state or restarting slurmctld during a\"\n info \"replacement; wait for it to complete, then re-run this script.\"\n fi\n info \" diagnostic context: references/slurm-details.md#-c-controller-state--diagnostic-context\"\nfi\n\nif [[ \"$ISSUES_STR\" == *\" missed-auto-resume \"* ]]; then\n hint \"Section E — recent job failed on a node that was later replaced:\"\n info \" the most common cause is --auto-resume on sbatch instead of srun.\"\n info \" Use SageMaker HyperPod auto-resume:\"\n info \" https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-auto-resume.html\"\n info \" diagnostic context: references/slurm-details.md#hyperpod-auto-resume\"\nfi\n\n# --- Summary ------------------------------------------------------------------\nsection \"Summary\"\nprintf ' Issues detected: %d\\n' \"${#ISSUES[@]-0}\"\nif [[ ${#ISSUES[@]-0} -eq 0 ]]; then\n ok \"cluster Slurm state is healthy\"\nelse\n echo \"\"\n echo \" Findings:\"\n for i in \"${ISSUES[@]}\"; do\n info \"- $i\"\n done\nfi\n\nif [[ ${#NEXT_STEPS[@]-0} -gt 0 ]]; then\n echo \"\"\n echo \" Where to read next:\"\n for s in \"${NEXT_STEPS[@]}\"; do\n info \"- $s\"\n done\nfi\n\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":37873,"content_sha256":"4d491556ceab75d4da1d5b78b7e5cc499ff33d095307d188cc19f9de96bfd27e"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"HyperPod Slurm Debugger","type":"text"}]},{"type":"paragraph","content":[{"text":"Diagnostic-only. Identify and classify Slurm scheduler and node-daemon issues on HyperPod Slurm clusters. Do not run, recommend, or print any state-mutating command. For remediation, link to the official AWS or Slurm documentation.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"When to invoke","type":"text"}]},{"type":"paragraph","content":[{"text":"Invoke when the user reports any of the symptoms in the ","type":"text"},{"text":"decision table","type":"text","marks":[{"type":"link","attrs":{"href":"#decision-table","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"When NOT to invoke","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Cluster has ","type":"text"},{"text":"Orchestrator.Eks","type":"text","marks":[{"type":"code_inline"}]},{"text":" — invoke ","type":"text"},{"text":"hyperpod-node-debugger","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"hyperpod-nccl","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Single-node hardware fault with healthy Slurm scheduler — invoke ","type":"text"},{"text":"hyperpod-node-debugger","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"NCCL training-hang investigation — invoke ","type":"text"},{"text":"hyperpod-nccl","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Node unreachable via SSM — invoke ","type":"text"},{"text":"hyperpod-ssm","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Constraints","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Read-only. Do not run, recommend, or print state-mutating commands.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For any remediation, link to AWS or Slurm docs. The user authorizes and executes.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"IaC-managed cluster (Terraform / CloudFormation / CDK): warn that direct mutation drifts the live state from the IaC plan.","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Canonical recovery URLs: ","type":"text"},{"text":"references/slurm-details.md → Authoritative recovery documentation","type":"text","marks":[{"type":"link","attrs":{"href":"references/slurm-details.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Prerequisites","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"AWS CLI v2, authenticated for the target account and region with permissions:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"sagemaker:DescribeCluster","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"sagemaker:ListClusterNodes","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"ssm:StartSession","type":"text","marks":[{"type":"code_inline"}]},{"text":" on the HyperPod-created SSM document","type":"text"}]}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Session Manager plugin","type":"text","marks":[{"type":"link","attrs":{"href":"https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html","title":null}}]},{"text":" installed locally.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"jq","type":"text","marks":[{"type":"code_inline"}]},{"text":" ≥ 1.6.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"unbuffer","type":"text","marks":[{"type":"code_inline"}]},{"text":" (from the ","type":"text"},{"text":"expect","type":"text","marks":[{"type":"code_inline"}]},{"text":" package). Required — without it ","type":"text"},{"text":"aws ssm start-session","type":"text","marks":[{"type":"code_inline"}]},{"text":" returns empty stdout intermittently with ","type":"text"},{"text":"Cannot perform start session: EOF","type":"text","marks":[{"type":"code_inline"}]},{"text":" and every check silently misreports. Install: ","type":"text"},{"text":"expect","type":"text","marks":[{"type":"code_inline"}]},{"text":" package on Amazon Linux / RHEL / Debian / Ubuntu / macOS. Script exits at prerequisite check if missing.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Procedure","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 1 — Collect inputs","type":"text"}]},{"type":"paragraph","content":[{"text":"Ask the user for:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"HyperPod cluster name (not Slurm partition name).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"AWS region.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Optional: a specific Slurm node name.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 2 — Confirm orchestrator","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"aws sagemaker describe-cluster --cluster-name \u003cNAME/ARN> --region \u003cREGION> \\\n --query 'Orchestrator' --output json","type":"text"}]},{"type":"paragraph","content":[{"text":"If ","type":"text"},{"text":"Orchestrator.Eks","type":"text","marks":[{"type":"code_inline"}]},{"text":" is present, stop. Route per ","type":"text"},{"text":"When NOT to invoke","type":"text","marks":[{"type":"link","attrs":{"href":"#when-not-to-invoke","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 3 — Run the diagnostic script","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"bash scripts/slurm-diagnose.sh --cluster \u003cNAME> --region \u003cREGION>\n# Scope to a node:\nbash scripts/slurm-diagnose.sh --cluster \u003cNAME> --region \u003cREGION> --node \u003cSLURM_NODE>","type":"text"}]},{"type":"paragraph","content":[{"text":"Relay the script output to the user verbatim.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Step 4 — Map findings → docs","type":"text"}]},{"type":"paragraph","content":[{"text":"For each finding, look up the section in the ","type":"text"},{"text":"decision table","type":"text","marks":[{"type":"link","attrs":{"href":"#decision-table","title":null}}]},{"text":" and link the user to the corresponding AWS / Slurm doc. Do not type out remediation commands.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Decision table","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Symptom (","type":"text"},{"text":"sinfo -o \"%N %T %30E\"","type":"text","marks":[{"type":"code_inline"}]},{"text":" or script finding)","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Section","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Node state = ","type":"text"},{"text":"down","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"down*","type":"text","marks":[{"type":"code_inline"}]},{"text":", reason other than below","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"A: Node Down","type":"text","marks":[{"type":"link","attrs":{"href":"#a-node-down","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Node state = ","type":"text"},{"text":"down*","type":"text","marks":[{"type":"code_inline"}]},{"text":", Reason = ","type":"text"},{"text":"Node unexpectedly rebooted","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"B: Unexpected Reboot","type":"text","marks":[{"type":"link","attrs":{"href":"#b-unexpected-reboot","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Jobs ","type":"text"},{"text":"PENDING","type":"text","marks":[{"type":"code_inline"}]},{"text":" with ","type":"text"},{"text":"REASON=Resources","type":"text","marks":[{"type":"code_inline"}]},{"text":" while nodes are idle","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"C: Controller State","type":"text","marks":[{"type":"link","attrs":{"href":"#c-controller-state","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Jobs stuck ","type":"text"},{"text":"COMPLETING","type":"text","marks":[{"type":"code_inline"}]},{"text":" after node replacement","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"C: Controller State","type":"text","marks":[{"type":"link","attrs":{"href":"#c-controller-state","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"scontrol ping","type":"text","marks":[{"type":"code_inline"}]},{"text":" returns ","type":"text"},{"text":"DOWN","type":"text","marks":[{"type":"code_inline"}]},{"text":" for the controller","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"C: Controller State","type":"text","marks":[{"type":"link","attrs":{"href":"#c-controller-state","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"GRES (GPU) counts incorrect or not released","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"C: Controller State","type":"text","marks":[{"type":"link","attrs":{"href":"#c-controller-state","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"state=fail","type":"text","marks":[{"type":"code_inline"}]},{"text":" issued but no recovery occurred","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"D: Action Reason Mismatch","type":"text","marks":[{"type":"link","attrs":{"href":"#d-action-reason-mismatch","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Accounting errors or RPC errors mentioning ","type":"text"},{"text":"dbd","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"C: Controller State","type":"text","marks":[{"type":"link","attrs":{"href":"#c-controller-state","title":null}}]},{"text":" (slurmdbd)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"slurm.conf","type":"text","marks":[{"type":"code_inline"}]},{"text":" edited; new partitions or nodes not visible","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"C: Controller State","type":"text","marks":[{"type":"link","attrs":{"href":"#c-controller-state","title":null}}]},{"text":" (config)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Job exited on a hardware failure but did not restart","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"E: Auto-resume","type":"text","marks":[{"type":"link","attrs":{"href":"#e-auto-resume","title":null}}]}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Defaults","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Behavior","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Default","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Override","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Mode","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"read-only — always; no remediation flag exists","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"n/a","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Region","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"$AWS_DEFAULT_REGION","type":"text","marks":[{"type":"code_inline"}]},{"text":", falling back to ","type":"text"},{"text":"us-east-1","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"--region \u003cR>","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Scope","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"all nodes in ","type":"text"},{"text":"down","type":"text","marks":[{"type":"code_inline"}]},{"text":" / ","type":"text"},{"text":"drain","type":"text","marks":[{"type":"code_inline"}]},{"text":" / ","type":"text"},{"text":"fail","type":"text","marks":[{"type":"code_inline"}]},{"text":" / \"unexpectedly rebooted\"","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"--node \u003cSLURM_NODE_NAME>","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Output","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"colorized terminal","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"--no-color","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"SSM target format","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"sagemaker-cluster:\u003cclusterId>_\u003cinstanceGroupName>-\u003cinstanceId>","type":"text","marks":[{"type":"code_inline"}]},{"text":" (derived)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"n/a","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Controller discovery","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"--controller-group","type":"text","marks":[{"type":"code_inline"}]},{"text":" (if set) → ","type":"text"},{"text":"SlurmConfig.NodeType=Controller","type":"text","marks":[{"type":"code_inline"}]},{"text":" → ","type":"text"},{"text":"provisioning_parameters.json","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"--controller-group \u003cN>","type":"text","marks":[{"type":"code_inline"}]}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Error handling","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Failure","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Skill behavior","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Required user action","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"describe-cluster","type":"text","marks":[{"type":"code_inline"}]},{"text":" fails","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Print AWS error; exit 1","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Fix credentials/region; verify cluster name","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Cluster has ","type":"text"},{"text":"Orchestrator.Eks","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Exit 1 with pointer to EKS-side skills","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Use ","type":"text"},{"text":"hyperpod-node-debugger","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"hyperpod-nccl","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"session-manager-plugin","type":"text","marks":[{"type":"code_inline"}]},{"text":" missing / SSM unreachable","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"sinfo","type":"text","marks":[{"type":"code_inline"}]},{"text":" returns empty; exit 1","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Install plugin; verify node ","type":"text"},{"text":"InService","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Disk ≥ 95 % full on a ","type":"text"},{"text":"down","type":"text","marks":[{"type":"code_inline"}]},{"text":" node","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Report finding ","type":"text"},{"text":"disk-full-\u003cnode>","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Refer to AWS troubleshooting docs","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Missing ","type":"text"},{"text":"jq","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"aws","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Exit 1 at prerequisite check","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Install per ","type":"text"},{"text":"Prerequisites","type":"text","marks":[{"type":"link","attrs":{"href":"#prerequisites","title":null}}]}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"A: Node Down","type":"text"}]},{"type":"paragraph","content":[{"text":"Node is ","type":"text"},{"text":"down","type":"text","marks":[{"type":"code_inline"}]},{"text":" because ","type":"text"},{"text":"slurmd","type":"text","marks":[{"type":"code_inline"}]},{"text":" stopped responding. Causes: ","type":"text"},{"text":"slurmd","type":"text","marks":[{"type":"code_inline"}]},{"text":" crash, disk full, OOM, network partition, hardware fault.","type":"text"}]},{"type":"paragraph","content":[{"text":"Script checks: ","type":"text"},{"text":"systemctl is-active slurmd","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"srun -w \u003cNODE> hostname","type":"text","marks":[{"type":"code_inline"}]},{"text":" (RPC layer), disk, memory.","type":"text"}]},{"type":"paragraph","content":[{"text":"Link: ","type":"text"},{"text":"https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md","type":"text","marks":[{"type":"link","attrs":{"href":"https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md","title":null}}]}]},{"type":"paragraph","content":[{"text":"If node returns to ","type":"text"},{"text":"down","type":"text","marks":[{"type":"code_inline"}]},{"text":" after a manual resume → escalate to ","type":"text"},{"text":"hyperpod-node-debugger","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"paragraph","content":[{"text":"Context: ","type":"text"},{"text":"references/slurm-details.md § A","type":"text","marks":[{"type":"link","attrs":{"href":"references/slurm-details.md#-a-node-down--diagnostic-context","title":null}}]},{"text":".","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"B: Unexpected Reboot","type":"text"}]},{"type":"paragraph","content":[{"text":"Node is ","type":"text"},{"text":"down*","type":"text","marks":[{"type":"code_inline"}]},{"text":" with Reason ","type":"text"},{"text":"\"Node unexpectedly rebooted\"","type":"text","marks":[{"type":"code_inline"}]},{"text":" because ","type":"text"},{"text":"slurmd","type":"text","marks":[{"type":"code_inline"}]},{"text":" re-registered after an out-of-band reboot. Upstream Slurm behavior, not HyperPod. Node is typically healthy.","type":"text"}]},{"type":"paragraph","content":[{"text":"Links:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md","type":"text","marks":[{"type":"link","attrs":{"href":"https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md","title":null}}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"https://slurm.schedmd.com/scontrol.html","type":"text","marks":[{"type":"link","attrs":{"href":"https://slurm.schedmd.com/scontrol.html","title":null}}]},{"text":" (","type":"text"},{"text":"state=resume","type":"text","marks":[{"type":"code_inline"}]},{"text":" semantics)","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"If node reboots again within minutes → escalate to ","type":"text"},{"text":"hyperpod-node-debugger","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"paragraph","content":[{"text":"Context: ","type":"text"},{"text":"references/slurm-details.md § B","type":"text","marks":[{"type":"link","attrs":{"href":"references/slurm-details.md#-b-unexpected-reboot--diagnostic-context","title":null}}]},{"text":".","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"C: Controller State","type":"text"}]},{"type":"paragraph","content":[{"text":"slurmctld","type":"text","marks":[{"type":"code_inline"}]},{"text":" in-memory state can desync from the on-disk state. A controller restart reloads from ","type":"text"},{"text":"StateSaveLocation","type":"text","marks":[{"type":"code_inline"}]},{"text":" and clears bad caches. User decides and executes.","type":"text"}]},{"type":"paragraph","content":[{"text":"Restart may help:","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Symptom","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Why","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"PENDING","type":"text","marks":[{"type":"code_inline"}]},{"text":" with ","type":"text"},{"text":"REASON=Resources","type":"text","marks":[{"type":"code_inline"}]},{"text":", idle nodes","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Re-evaluates the queue","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Jobs stuck ","type":"text"},{"text":"COMPLETING","type":"text","marks":[{"type":"code_inline"}]},{"text":" after node replacement","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Controller held a reference to the old node","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"GRES (GPU, EFA) not released after a job ends","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Resource accounting de-synced","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Nodes stuck ","type":"text"},{"text":"Unknown","type":"text","marks":[{"type":"code_inline"}]},{"text":" after reboot, ","type":"text"},{"text":"slurmd","type":"text","marks":[{"type":"code_inline"}]},{"text":" is up","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Re-registration was not processed","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"scontrol ping","type":"text","marks":[{"type":"code_inline"}]},{"text":" times out","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Controller event loop is hung","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Lost connection to ","type":"text"},{"text":"slurmdbd","type":"text","marks":[{"type":"code_inline"}]},{"text":" / RPC errors","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"DBD connection wedged","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"Do NOT restart when:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"HyperPod replacement (","type":"text"},{"text":"Action:Replace","type":"text","marks":[{"type":"code_inline"}]},{"text":") in progress on any node — concurrent changes fail the replacement.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Only one compute node is bad — restart ","type":"text"},{"text":"slurmd","type":"text","marks":[{"type":"code_inline"}]},{"text":" on that node.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"sinfo","type":"text","marks":[{"type":"code_inline"}]},{"text":" and ","type":"text"},{"text":"squeue","type":"text","marks":[{"type":"code_inline"}]},{"text":" are responsive — problem is elsewhere.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"journalctl -u slurmctld","type":"text","marks":[{"type":"code_inline"}]},{"text":" not reviewed yet — panic / OOM will reproduce.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"slurm.conf","type":"text","marks":[{"type":"code_inline"}]},{"text":" was just edited — try ","type":"text"},{"text":"scontrol reconfigure","type":"text","marks":[{"type":"code_inline"}]},{"text":" first.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Folded triggers","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"slurmdbd disconnected","type":"text","marks":[{"type":"strong"}]},{"text":" — ","type":"text"},{"text":"sacct","type":"text","marks":[{"type":"code_inline"}]},{"text":" fails, accounting fields show ","type":"text"},{"text":"Unknown","type":"text","marks":[{"type":"code_inline"}]},{"text":", controller log spams ","type":"text"},{"text":"Unable to contact slurmdbd","type":"text","marks":[{"type":"code_inline"}]},{"text":". Restore ","type":"text"},{"text":"slurmdbd","type":"text","marks":[{"type":"code_inline"}]},{"text":" before considering controller restart. ","type":"text"},{"text":"https://slurm.schedmd.com/accounting.html","type":"text","marks":[{"type":"link","attrs":{"href":"https://slurm.schedmd.com/accounting.html","title":null}}]},{"text":" · ","type":"text"},{"text":"details","type":"text","marks":[{"type":"link","attrs":{"href":"references/slurm-details.md#slurmdbd-connectivity","title":null}}]},{"text":".","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Stale config","type":"text","marks":[{"type":"strong"}]},{"text":" — ","type":"text"},{"text":"slurm.conf","type":"text","marks":[{"type":"code_inline"}]},{"text":" / ","type":"text"},{"text":"topology.conf","type":"text","marks":[{"type":"code_inline"}]},{"text":" mtime > slurmctld start. ","type":"text"},{"text":"scontrol reconfigure","type":"text","marks":[{"type":"code_inline"}]},{"text":" first; restart is fallback. ","type":"text"},{"text":"https://slurm.schedmd.com/scontrol.html","type":"text","marks":[{"type":"link","attrs":{"href":"https://slurm.schedmd.com/scontrol.html","title":null}}]},{"text":" · ","type":"text"},{"text":"details","type":"text","marks":[{"type":"link","attrs":{"href":"references/slurm-details.md#scontrol-reconfigure-vs-restart","title":null}}]},{"text":".","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Restart procedure / what's preserved:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"https://slurm.schedmd.com/slurmctld.html","type":"text","marks":[{"type":"link","attrs":{"href":"https://slurm.schedmd.com/slurmctld.html","title":null}}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md","type":"text","marks":[{"type":"link","attrs":{"href":"https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/troubleshooting-doc-20250917/troubleshoot/index.md","title":null}}]}]}]}]},{"type":"paragraph","content":[{"text":"Context: ","type":"text"},{"text":"references/slurm-details.md § C","type":"text","marks":[{"type":"link","attrs":{"href":"references/slurm-details.md#-c-controller-state--diagnostic-context","title":null}}]},{"text":".","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"D: Action Reason Mismatch","type":"text"}]},{"type":"paragraph","content":[{"text":"scontrol update state=fail reason=...","type":"text","marks":[{"type":"code_inline"}]},{"text":" was issued with a ","type":"text"},{"text":"reason","type":"text","marks":[{"type":"code_inline"}]},{"text":" that does not match ","type":"text"},{"text":"Action:Reboot","type":"text","marks":[{"type":"code_inline"}]},{"text":" or ","type":"text"},{"text":"Action:Replace","type":"text","marks":[{"type":"code_inline"}]},{"text":" exactly. HyperPod silently ignores anything else. Script detects near-misses on nodes in ","type":"text"},{"text":"fail","type":"text","marks":[{"type":"code_inline"}]},{"text":" state.","type":"text"}]},{"type":"paragraph","content":[{"text":"Required strings (case-sensitive, no whitespace, no punctuation):","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Action:Reboot","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Action:Replace","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"paragraph","content":[{"text":"Link: ","type":"text"},{"text":"https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html","type":"text","marks":[{"type":"link","attrs":{"href":"https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html","title":null}}]}]},{"type":"paragraph","content":[{"text":"Context: ","type":"text"},{"text":"references/slurm-details.md § Action reason-string validation","type":"text","marks":[{"type":"link","attrs":{"href":"references/slurm-details.md#action-reason-string-validation","title":null}}]},{"text":".","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"E: Auto-resume","type":"text"}]},{"type":"paragraph","content":[{"text":"--auto-resume=1","type":"text","marks":[{"type":"code_inline"}]},{"text":" is an ","type":"text"},{"text":"srun","type":"text","marks":[{"type":"code_inline"}]},{"text":" step option. It re-runs the step after HMA (the Health Monitoring Agent) flags a node and Automatic node recovery replaces it.","type":"text"}]},{"type":"paragraph","content":[{"text":"Why it didn't restart the job:","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Flag on ","type":"text"},{"text":"sbatch","type":"text","marks":[{"type":"code_inline"}]},{"text":" not ","type":"text"},{"text":"srun","type":"text","marks":[{"type":"code_inline"}]},{"text":" — per-step; ","type":"text"},{"text":"sbatch","type":"text","marks":[{"type":"code_inline"}]},{"text":" directives are silently ignored.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"HMA did not flag the node — failure was application/transient, not hardware. Step exits as a normal Slurm failure.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Cluster ","type":"text"},{"text":"NodeRecovery","type":"text","marks":[{"type":"code_inline"}]},{"text":" is ","type":"text"},{"text":"None","type":"text","marks":[{"type":"code_inline"}]},{"text":" — faulty nodes are labeled but not replaced.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"No checkpointing — step restarts from process zero each iteration.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"AMI predates HMA support (released 2025-09-11) — needs AMI / cluster-software update.","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"Link: ","type":"text"},{"text":"https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-auto-resume.html","type":"text","marks":[{"type":"link","attrs":{"href":"https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-auto-resume.html","title":null}}]}]},{"type":"paragraph","content":[{"text":"Context: ","type":"text"},{"text":"references/slurm-details.md § HyperPod auto-resume","type":"text","marks":[{"type":"link","attrs":{"href":"references/slurm-details.md#hyperpod-auto-resume","title":null}}]},{"text":".","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Escalation","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Condition","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Next skill","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Node returns to ","type":"text"},{"text":"down","type":"text","marks":[{"type":"code_inline"}]},{"text":" shortly after a manual resume","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"hyperpod-node-debugger","type":"text","marks":[{"type":"code_inline"}]},{"text":" (hardware)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"slurmd","type":"text","marks":[{"type":"code_inline"}]},{"text":" logs contain CUDA / NVIDIA / XID errors","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"hyperpod-node-debugger","type":"text","marks":[{"type":"code_inline"}]},{"text":" § G","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Disk full or ","type":"text"},{"text":"/dev/shm","type":"text","marks":[{"type":"code_inline"}]},{"text":" exhausted","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"hyperpod-node-debugger","type":"text","marks":[{"type":"code_inline"}]},{"text":" § I","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Node unreachable via SSM","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"hyperpod-ssm","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Controller restart does not clear ","type":"text"},{"text":"COMPLETING","type":"text","marks":[{"type":"code_inline"}]},{"text":" after 2 attempts","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"hyperpod-issue-report","type":"text","marks":[{"type":"code_inline"}]},{"text":" + AWS Support","type":"text"}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"hyperpod-slurm-debugger","author":"@skillopedia","source":{"stars":765,"repo_name":"agent-plugins","origin_url":"https://github.com/awslabs/agent-plugins/blob/HEAD/plugins/sagemaker-ai/skills/hyperpod-slurm-debugger/SKILL.md","repo_owner":"awslabs","body_sha256":"e33af3d3553d3961a08ac4c628d842c16ce22b6bcc5ebdebef0ff12454f8956b","cluster_key":"3ea5bf68d0e58f2d26aef0b4aeeea3ce98337fba383af8f202140f3153ba8dcf","clean_bundle":{"format":"clean-skill-bundle-v1","source":"awslabs/agent-plugins/plugins/sagemaker-ai/skills/hyperpod-slurm-debugger/SKILL.md","attachments":[{"id":"ccfd8d3d-f82f-5d95-a8c4-74c2944d5770","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/ccfd8d3d-f82f-5d95-a8c4-74c2944d5770/attachment.md","path":"references/slurm-details.md","size":13600,"sha256":"1f9d6121a9630e3e0e25c36b7c5ae83c7dd11cf683fa3e2491bb1b016d0727a4","contentType":"text/markdown; charset=utf-8"},{"id":"3672c378-20b0-52a5-bdbd-1969185ba40f","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/3672c378-20b0-52a5-bdbd-1969185ba40f/attachment.sh","path":"scripts/slurm-diagnose.sh","size":37873,"sha256":"4d491556ceab75d4da1d5b78b7e5cc499ff33d095307d188cc19f9de96bfd27e","contentType":"application/x-sh; charset=utf-8"}],"bundle_sha256":"f17ab06cd9b012799b20eecb02d9159beb4ca213a9d71d467524e4eca2df873e","attachment_count":2,"text_attachments":2,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"plugins/sagemaker-ai/skills/hyperpod-slurm-debugger/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"web-development","category_label":"Web"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"web-development","metadata":{"version":"0.0.1"},"import_tag":"clean-skills-v1","description":"Diagnostic-only skill for Slurm scheduler and node-daemon issues on Amazon SageMaker HyperPod Slurm clusters. Scope mirrors the HyperPod troubleshooting guide. Invoke when the user reports a Slurm node stuck in down/drain, \"Node unexpectedly rebooted\" after auto-repair, slurmd not running, jobs stuck PENDING with REASON=Resources while sinfo shows idle nodes, jobs stuck COMPLETING after node replacement, GRES/GPU counts wrong, scontrol ping failing, slurmctld unresponsive, an Action:Reboot/Replace request that did not trigger HyperPod auto-recovery, or auto-resume not restarting a job. Also triggers on \"drain before reboot\", \"diagnose a Slurm node\", \"investigate stuck jobs.\""}},"renderedAt":1782986985669}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.