hyperpod-cluster-debugger

HyperPod Cluster Debugger Operating policy. Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a Suggested command (run this yourself) block and wait for the customer to run it. Destructive order: investigate → reboot → replace (replace destroys root + secondary volumes; not supported on Slurm controller nodes). Before any state-changing CLI: ask if it's IaC-managed. HyperPod clusters, SGs, EKS access entries, and IAM are usually provisioned via CloudFormation / CDK / Terraform. If yes, the fix belongs in IaC — running t…

\\t' read -r sid az; do\n [[ -z \"$sid\" ]] && continue\n if echo \"$AZ_OFFERINGS\" | tr '\\t' '\\n' | grep -qx \"$az\"; then\n pass \"Subnet $sid (AZ=$az)\" \"$VALIDATE_INSTANCE_TYPE is available\"\n MATCHED=$((MATCHED+1))\n else\n fail \"Subnet $sid (AZ=$az)\" \"$VALIDATE_INSTANCE_TYPE NOT offered here\"\n add_issue \"Subnet $sid AZ=$az does not offer $VALIDATE_INSTANCE_TYPE → references/capacity-planning.md\" \"P0\"\n fi\n done \u003c \u003c(echo \"$SUB_AZ_JSON\" | python3 -c \"\nimport sys, json\nfor s in json.load(sys.stdin):\n print(f\\\"{s.get('SubnetId','')}\\t{s.get('AZ','')}\\\")\n\" 2>/dev/null)\n\n if [[ $MATCHED -eq 0 ]]; then\n warn \"No provided subnet is in an AZ that offers $VALIDATE_INSTANCE_TYPE — cluster creation will fail with Insufficient capacity / No subnets in the capacity AZ\"\n fi\n fi\n fi\n\n if [[ -n \"$VALIDATE_S3_URI\" ]]; then\n header \"V6. S3 Lifecycle Scripts\"\n if [[ ! \"$VALIDATE_S3_URI\" =~ ^s3:// ]]; then\n fail \"S3 URI\" \"must start with s3:// (got '$VALIDATE_S3_URI')\"\n add_issue \"S3 URI is not a valid s3:// URI → references/lifecycle-scripts.md\" \"P0\"\n else\n S3_URI_NORM=\"${VALIDATE_S3_URI%/}/\"\n info \"S3 URI: $S3_URI_NORM\"\n\n S3_LIST=$(aws_check \"s3-ls-$S3_URI_NORM\" \\\n aws s3 ls \"$S3_URI_NORM\" --region \"$REGION\") || S3_LIST=\"\"\n\n if [[ -z \"$S3_LIST\" ]]; then\n fail \"S3 access\" \"cannot list $S3_URI_NORM — bucket missing, permissions denied, or empty prefix\"\n add_issue \"S3 URI not accessible or empty: $S3_URI_NORM → references/lifecycle-scripts.md\" \"P0\"\n else\n pass \"S3 access\" \"prefix is listable\"\n\n if echo \"$S3_LIST\" | grep -q \"on_create.sh\"; then\n pass \"on_create.sh\" \"entry script present\"\n\n TMPFILE=$(mktemp)\n if aws s3 cp \"${S3_URI_NORM}on_create.sh\" \"$TMPFILE\" \\\n --region \"$REGION\" --only-show-errors 2>/dev/null; then\n if file \"$TMPFILE\" | grep -q \"CRLF\"; then\n fail \"on_create.sh\" \"has Windows CRLF line endings — will fail on Linux\"\n add_issue \"on_create.sh has CRLF line endings → references/lifecycle-scripts.md\" \"P0\"\n else\n pass \"on_create.sh\" \"Unix line endings\"\n fi\n if head -1 \"$TMPFILE\" | grep -q \"^#!\"; then\n pass \"on_create.sh\" \"shebang present\"\n else\n warn \"on_create.sh\" \"missing shebang (#!/bin/bash)\"\n add_issue \"on_create.sh missing shebang → references/lifecycle-scripts.md\" \"P1\"\n fi\n else\n warn \"on_create.sh\" \"could not download for inspection\"\n fi\n rm -f \"$TMPFILE\"\n else\n fail \"on_create.sh\" \"entry script NOT FOUND at $S3_URI_NORM — cluster creation will fail\"\n add_issue \"Missing on_create.sh at $S3_URI_NORM → references/lifecycle-scripts.md\" \"P0\"\n fi\n\n if echo \"$S3_LIST\" | grep -q \"lifecycle_script.py\"; then\n pass \"Orchestrator script\" \"lifecycle_script.py present (Slurm)\"\n elif echo \"$S3_LIST\" | grep -q \"on_create_main.sh\"; then\n pass \"Orchestrator script\" \"on_create_main.sh present (EKS)\"\n else\n warn \"Orchestrator script\" \"neither lifecycle_script.py (Slurm) nor on_create_main.sh (EKS) found at $S3_URI_NORM\"\n add_issue \"Missing orchestrator-specific lifecycle script at $S3_URI_NORM → references/lifecycle-scripts.md\" \"P1\"\n fi\n fi\n fi\n fi\n\n echo \"\"\n echo -e \"${BOLD}========================================${NC}\"\n echo -e \"${BOLD} VALIDATION SUMMARY ${NC}\"\n echo -e \"${BOLD}========================================${NC}\"\n echo \"\"\n echo -e \" Results: ${RED}${CRITICAL_FAILURES} critical${NC} | ${YELLOW}${WARNINGS} warnings${NC}\"\n echo -e \" Mode: READ-ONLY (no changes made; each [FAIL] points to a references section)\"\n echo \"\"\n if [[ ${#ISSUES_FOUND[@]} -gt 0 ]]; then\n echo -e \"${BOLD} Issues:${NC}\"\n for priority in P0 P1 P2; do\n for issue in \"${ISSUES_FOUND[@]}\"; do\n if [[ \"$issue\" == \"${priority}|\"* ]]; then\n desc=\"${issue#*|}\"\n case \"$priority\" in\n P0) echo -e \" ${RED}[${priority}]${NC} $desc\" ;;\n P1) echo -e \" ${YELLOW}[${priority}]${NC} $desc\" ;;\n P2) echo -e \" [${priority}] $desc\" ;;\n esac\n fi\n done\n done\n echo \"\"\n fi\n if [[ $CRITICAL_FAILURES -eq 0 ]]; then\n echo -e \" ${GREEN}${BOLD}Pre-flight validation passed. Safe to create cluster.${NC}\"\n else\n echo -e \" ${RED}${BOLD}Fix P0 issues above before creating the cluster.${NC}\"\n fi\n echo \"\"\n exit \"$([[ $CRITICAL_FAILURES -eq 0 ]] && echo 0 || echo 1)\"\nfi\n\nsection \"HyperPod Cluster Diagnostics (read-only)\"\necho -e \"Cluster: ${BOLD}${CLUSTER}${NC}\"\necho -e \"Region: ${BOLD}${REGION}${NC}\"\necho -e \"Time: $(date -u +\"%Y-%m-%dT%H:%M:%SZ\")\"\necho -e \"${CYAN} No cluster state will be modified. Each issue line below includes a${NC}\"\necho -e \"${CYAN} pointer to references/cluster-diagnostics-detail.md for remediation.${NC}\"\n\nheader \"1. Cluster Identity & Status\"\n\nCLUSTER_JSON=$(aws sagemaker describe-cluster \\\n --cluster-name \"$CLUSTER\" \\\n --region \"$REGION\" \\\n --cli-read-timeout 30 \\\n --output json 2>&1) || {\n echo -e \"${RED}ERROR: Could not describe cluster '$CLUSTER' in region '$REGION'${NC}\"\n echo \"$CLUSTER_JSON\" | head -3\n echo \"\"\n if echo \"$CLUSTER_JSON\" | grep -qiE \"ResourceNotFound|Cluster with name .* not found\"; then\n echo \"Available clusters in $REGION:\"\n aws sagemaker list-clusters --region \"$REGION\" \\\n --query 'ClusterSummaries[*].{Name:ClusterName,Status:ClusterStatus}' \\\n --output table 2>/dev/null || echo \" (unable to list clusters — check IAM)\"\n else\n echo \"Verify:\"\n echo \" 1. Cluster name is correct (use: aws sagemaker list-clusters --region $REGION)\"\n echo \" 2. Region is correct\"\n echo \" 3. IAM permissions include sagemaker:DescribeCluster\"\n fi\n exit 1\n}\n\nCLUSTER_ARN=$(echo \"$CLUSTER_JSON\" | python3 -c \"import sys,json; print(json.load(sys.stdin).get('ClusterArn',''))\" 2>/dev/null)\nCLUSTER_ID=$(echo \"$CLUSTER_ARN\" | awk -F'/' '{print $NF}')\nif [[ -z \"$CLUSTER_ID\" ]]; then\n echo \"ERROR: Could not extract cluster ID from ARN '$CLUSTER_ARN'. Verify the cluster name/ARN.\"\n exit 1\nfi\nCLUSTER_STATUS=$(echo \"$CLUSTER_JSON\" | python3 -c \"import sys,json; print(json.load(sys.stdin).get('ClusterStatus','unknown'))\" 2>/dev/null)\nORCHESTRATOR=$(echo \"$CLUSTER_JSON\" | python3 -c \"import sys,json; d=json.load(sys.stdin); o=d.get('Orchestrator',{}); print('EKS' if 'Eks' in o else 'Slurm')\" 2>/dev/null)\nNODE_RECOVERY=$(echo \"$CLUSTER_JSON\" | python3 -c \"\nimport sys,json\nd=json.load(sys.stdin)\n# Prefer cluster-level NodeRecovery (the API's canonical location); fall back to\n# per-InstanceGroup only when top-level is absent. Reading only per-group yields\n# 'Unknown' on every cluster because the field is null at group level when set\n# cluster-wide.\ntop=d.get('NodeRecovery')\nif top:\n print(top)\nelse:\n groups=d.get('InstanceGroups',[])\n recoveries={g.get('NodeRecovery') for g in groups if g.get('NodeRecovery')}\n print(','.join(sorted(recoveries)) if recoveries else 'Unknown')\n\" 2>/dev/null || echo \"Unknown\")\n\ninfo \"ARN: $CLUSTER_ARN\"\ninfo \"Cluster ID: $CLUSTER_ID\"\ninfo \"Status: $CLUSTER_STATUS\"\ninfo \"Orchestrator: $ORCHESTRATOR\"\ninfo \"NodeRecovery: $NODE_RECOVERY\"\n\n# Flag auto-recovery disabled regardless of orchestrator.\nif [[ \"$NODE_RECOVERY\" == *\"None\"* && \"$NODE_RECOVERY\" == *\"Automatic\"* ]]; then\n warn \"NodeRecovery\" \"mixed settings — some instance groups have recovery disabled\"\n add_issue \"NodeRecovery disabled on some instance groups → references/cluster-diagnostics-detail.md § G (Node Replacement)\" \"P2\"\nelif [[ \"$NODE_RECOVERY\" == *\"None\"* ]]; then\n warn \"NodeRecovery\" \"disabled on all instance groups — auto-replacement won't trigger\"\n add_issue \"NodeRecovery disabled → references/cluster-diagnostics-detail.md § G (Node Replacement)\" \"P2\"\nfi\n\nCREATION_TIME=$(echo \"$CLUSTER_JSON\" | python3 -c \"\nimport sys,json\nd=json.load(sys.stdin)\nct=d.get('CreationTime','')\nprint(ct if ct else '')\n\" 2>/dev/null || echo \"\")\n\nLAST_MODIFIED_TIME=$(echo \"$CLUSTER_JSON\" | python3 -c \"\nimport sys,json\nd=json.load(sys.stdin)\nlm=d.get('LastModifiedTime','')\nprint(lm if lm else '')\n\" 2>/dev/null || echo \"\")\n\nSTUCK_THRESHOLD_SECONDS=3600\n\nis_stuck() {\n local creation_time=\"$1\"\n if [[ -z \"$creation_time\" ]]; then echo \"false\"; return; fi\n CREATION_TS=\"$creation_time\" THRESHOLD=\"$STUCK_THRESHOLD_SECONDS\" python3 -c \"\nimport os\nfrom datetime import datetime, timezone\nct = os.environ['CREATION_TS']\nthreshold = int(os.environ['THRESHOLD'])\ntry:\n ct=ct.replace('+00:00','Z').rstrip('Z')\n if '.' in ct: ct=ct[:ct.index('.')+7]\n created=datetime.fromisoformat(ct).replace(tzinfo=timezone.utc)\n elapsed=(datetime.now(timezone.utc)-created).total_seconds()\n print('true' if elapsed > threshold else 'false')\nexcept (ValueError, TypeError):\n # Unparseable timestamp — assume not stuck rather than abort the whole run.\n print('false')\n\" 2>/dev/null || echo \"false\"\n}\n\ncase \"$CLUSTER_STATUS\" in\n InService) pass \"Cluster status\" \"InService\" ;;\n Creating)\n STUCK=$(is_stuck \"$CREATION_TIME\")\n if [[ \"$STUCK\" == \"true\" ]]; then\n fail \"Cluster status\" \"Creating for over 1 hour — likely stuck\"\n add_issue \"Cluster stuck in Creating > 1hr → references/cluster-diagnostics-detail.md § E (Cluster Provisioning), § H (CloudFormation)\" \"P0\"\n else\n warn \"Cluster status\" \"Creating — cluster is still being provisioned\"\n add_issue \"Cluster still creating → references/cluster-diagnostics-detail.md § E (Cluster Provisioning)\" \"P1\"\n fi ;;\n Updating)\n STUCK=$(is_stuck \"${LAST_MODIFIED_TIME:-$CREATION_TIME}\")\n if [[ \"$STUCK\" == \"true\" ]]; then\n fail \"Cluster status\" \"Updating — check if operation is stuck\"\n add_issue \"Cluster may be stuck Updating → references/cluster-diagnostics-detail.md § E (Cluster Provisioning), § H (CloudFormation)\" \"P1\"\n else\n warn \"Cluster status\" \"Updating — cluster operation in progress\"\n fi ;;\n Failed) fail \"Cluster status\" \"Failed — check events and CloudFormation\"; add_issue \"Cluster FAILED → references/cluster-diagnostics-detail.md § E (Cluster Provisioning), § H (CloudFormation)\" \"P0\" ;;\n Deleting)\n STUCK=$(is_stuck \"${LAST_MODIFIED_TIME:-$CREATION_TIME}\")\n if [[ \"$STUCK\" == \"true\" ]]; then\n warn \"Cluster status\" \"Deleting for extended time — may be blocked by VPC ENI dependencies\"\n add_issue \"Cluster stuck Deleting → references/cluster-diagnostics-detail.md § E (Cluster Provisioning)\" \"P1\"\n else\n warn \"Cluster status\" \"Deleting\"\n fi ;;\n RollingBack) warn \"Cluster status\" \"RollingBack — update is being rolled back\"; add_issue \"Cluster RollingBack → references/cluster-diagnostics-detail.md § J (AMI & Cluster Updates)\" \"P1\" ;;\n *RollbackFailed*|*MaintenanceFailed*)\n fail \"Cluster status\" \"$CLUSTER_STATUS — cluster is stuck in a non-recoverable state\"\n add_issue \"Cluster stuck in $CLUSTER_STATUS → references/cluster-diagnostics-detail.md § J (AMI & Cluster Updates)\" \"P0\" ;;\n *) warn \"Cluster status\" \"$CLUSTER_STATUS\" ;;\nesac\n\nEKS_NAME=\"\"\nif [[ \"$ORCHESTRATOR\" == \"EKS\" ]]; then\n EKS_NAME=$(echo \"$CLUSTER_JSON\" | python3 -c \"\nimport sys,json\nd=json.load(sys.stdin)\narn=d.get('Orchestrator',{}).get('Eks',{}).get('ClusterArn','')\nprint(arn.split('/')[-1] if arn else '')\n\" 2>/dev/null || echo \"\")\n if [[ -n \"$EKS_NAME\" ]]; then\n info \"EKS Cluster: $EKS_NAME\"\n fi\nfi\n\nheader \"2. Instance Groups & Node Health\"\n\necho \"$CLUSTER_JSON\" | python3 -c \"\nimport sys, json\nd = json.load(sys.stdin)\ngroups = d.get('InstanceGroups', [])\nif not groups:\n print(' No instance groups found')\nelse:\n for g in groups:\n name = g.get('InstanceGroupName', '?')\n itype = g.get('InstanceType', '?')\n target = g.get('TargetCount', 0)\n current = g.get('CurrentCount', 0)\n status = g.get('Status', g.get('InstanceGroupStatus', '?'))\n threads = g.get('ThreadsPerCore', '?')\n # TargetStateCount is the count the service is working toward when a\n # resize is in flight; print when it differs from TargetCount.\n tstate = g.get('TargetStateCount', None)\n # Note: NodeRecovery is a cluster-level field in the DescribeCluster\n # response, not per-group; shown on the cluster header line above.\n print(f' {name}: type={itype} target={target} current={current} status={status} threads/core={threads}')\n if tstate is not None and tstate != target:\n print(f' TargetStateCount={tstate} (resize in progress)')\n if current \u003c target:\n print(f' Current count ({current}) \u003c target ({target}) — instances may still be provisioning or failed')\n\" 2>/dev/null\n\n# Check node-level details. Paginate — default page is small and large clusters\n# silently truncate, which would break dangling-node reconciliation below.\nfetch_all_cluster_nodes_cd() {\n local merged='[]' token='' page_json combined i=0\n local max_pages=200 # 200 × 100 = 20 000 nodes, supports 7k+ clusters\n while (( i \u003c max_pages )); do\n if [[ -n \"$token\" ]]; then\n page_json=$(aws sagemaker list-cluster-nodes \\\n --cluster-name \"$CLUSTER\" --region \"$REGION\" \\\n --max-results 100 --next-token \"$token\" \\\n --cli-read-timeout 30 --output json 2>&1) || break\n else\n page_json=$(aws sagemaker list-cluster-nodes \\\n --cluster-name \"$CLUSTER\" --region \"$REGION\" \\\n --max-results 100 \\\n --cli-read-timeout 30 --output json 2>&1) || break\n fi\n if echo \"$page_json\" | grep -qiE \"AccessDenied|not authorized|UnauthorizedAccess\"; then\n echo \"__AUTH_DENIED__\"\n return 1\n fi\n # Merge via stdin (NUL-delimited blobs) instead of argv — argv is capped at\n # ARG_MAX (~128KB on Linux), which fails at ~500 nodes of accumulated JSON.\n # Large clusters (7k+) need this path to avoid silent truncation.\n combined=$(printf '%s\\0%s' \"$merged\" \"$page_json\" | python3 -c \"\nimport sys, json\nblob = sys.stdin.buffer.read()\ntry:\n a, b = blob.split(b'\\0', 1)\n merged = json.loads(a)\n page = json.loads(b)\nexcept (json.JSONDecodeError, ValueError):\n sys.exit(2)\nmerged.extend(page.get('ClusterNodeSummaries', []))\nprint(json.dumps(merged))\nprint(page.get('NextToken', ''))\n\" 2>/dev/null) || break\n merged=$(printf '%s\\n' \"$combined\" | sed -n '1p')\n token=$(printf '%s\\n' \"$combined\" | sed -n '2p')\n i=$((i+1))\n [[ -z \"$token\" ]] && break\n done\n if (( i == max_pages )) && [[ -n \"$token\" ]]; then\n # Surface truncation via a marker file — this function runs inside $(...)\n # (command substitution subshell), so add_issue would be lost. The parent\n # shell checks for the marker after the call returns.\n echo \"WARN: list-cluster-nodes truncated at ${max_pages} pages (~$((max_pages*100)) nodes). Diagnostic sample is incomplete for very large clusters.\" >&2\n : > \"${_NODE_TRUNC_MARKER:-/dev/null}\" 2>/dev/null || true\n fi\n printf '%s' \"$merged\" | python3 -c \"\nimport sys, json\ntry:\n print(json.dumps({'ClusterNodeSummaries': json.loads(sys.stdin.read())}))\nexcept json.JSONDecodeError:\n print('{\\\"ClusterNodeSummaries\\\":[]}')\n\" 2>/dev/null || echo '{\"ClusterNodeSummaries\":[]}'\n}\n\n_NODE_TRUNC_MARKER=$(mktemp 2>/dev/null) && _CD_TEMP_FILES+=(\"$_NODE_TRUNC_MARKER\") || _NODE_TRUNC_MARKER=\"\"\nexport _NODE_TRUNC_MARKER\nrm -f \"$_NODE_TRUNC_MARKER\" 2>/dev/null || true\n\nNODE_LIST=$(fetch_all_cluster_nodes_cd)\nif [[ \"$NODE_LIST\" == \"__AUTH_DENIED__\" ]]; then\n warn \"list-cluster-nodes\" \"IAM permission denied — add sagemaker:ListClusterNodes to your role\"\n add_issue \"Missing IAM permission for sagemaker:ListClusterNodes → references/cluster-diagnostics-detail.md § D (EKS Access / kubectl)\" \"P1\"\n NODE_LIST='{\"ClusterNodeSummaries\":[]}'\nfi\n\n# Parent-shell follow-up for the truncation marker set inside the subshell.\nif [[ -n \"$_NODE_TRUNC_MARKER\" && -e \"$_NODE_TRUNC_MARKER\" ]]; then\n add_issue \"Node list truncated at 200 pages (~20000 nodes); diagnostic sample incomplete → references/cluster-diagnostics-detail.md § E (Cluster Provisioning)\" \"P2\"\nfi\n\nTOTAL_NODES=$(echo \"$NODE_LIST\" | python3 -c \"import sys,json; print(len(json.load(sys.stdin).get('ClusterNodeSummaries',[])))\" 2>/dev/null || echo 0)\ninfo \"Total nodes reported: $TOTAL_NODES\"\n\nUNHEALTHY_NODES=$(echo \"$NODE_LIST\" | python3 -c \"\nimport sys, json\nnodes = json.load(sys.stdin).get('ClusterNodeSummaries', [])\nunhealthy = [n for n in nodes if n.get('InstanceStatus', {}).get('Status', '') not in ('Running', 'Pending')]\nif unhealthy:\n for n in unhealthy:\n nid = n.get('InstanceId', '?')\n group = n.get('InstanceGroupName', '?')\n status = n.get('InstanceStatus', {}).get('Status', '?')\n msg = n.get('InstanceStatus', {}).get('Message', '')\n print(f' {nid} ({group}): {status} {msg}')\n print(f'UNHEALTHY_COUNT={len(unhealthy)}')\nelse:\n print('UNHEALTHY_COUNT=0')\n\" 2>/dev/null || echo \"UNHEALTHY_COUNT=0\")\n\nUNHEALTHY_COUNT=$(echo \"$UNHEALTHY_NODES\" | grep \"^UNHEALTHY_COUNT=\" | cut -d= -f2)\n[[ -z \"$UNHEALTHY_COUNT\" ]] && UNHEALTHY_COUNT=0\necho \"$UNHEALTHY_NODES\" | grep -v \"^UNHEALTHY_COUNT=\" || true\n\nif [[ \"$UNHEALTHY_COUNT\" -gt 0 ]]; then\n warn \"Node health\" \"$UNHEALTHY_COUNT unhealthy node(s)\"\n add_issue \"$UNHEALTHY_COUNT unhealthy node(s) → references/cluster-diagnostics-detail.md § G (Node Replacement); delegate to hyperpod-node-debugger\" \"P1\"\n\n echo \"$NODE_LIST\" | python3 -c \"\nimport sys, json\nfrom collections import defaultdict\nnodes = json.load(sys.stdin).get('ClusterNodeSummaries', [])\ngroups = defaultdict(lambda: {'total': 0, 'unhealthy': 0})\nfor n in nodes:\n g = n.get('InstanceGroupName', 'unknown')\n groups[g]['total'] += 1\n st = n.get('InstanceStatus', {}).get('Status', '')\n if st not in ('Running', 'Pending', ''):\n groups[g]['unhealthy'] += 1\nfor g, c in groups.items():\n if c['unhealthy'] > 0:\n pct = int(c['unhealthy'] / c['total'] * 100) if c['total'] > 0 else 0\n print(f' [WARN] Group {g}: {c[\\\"unhealthy\\\"]}/{c[\\\"total\\\"]} unhealthy ({pct}%)')\n\" 2>/dev/null\n\nelif [[ \"$TOTAL_NODES\" -eq 0 && \"$CLUSTER_STATUS\" == \"InService\" ]]; then\n warn \"Node health\" \"Cluster InService but 0 nodes reported\"\n add_issue \"Cluster InService but no nodes → references/cluster-diagnostics-detail.md § E (Cluster Provisioning)\" \"P1\"\nelse\n pass \"Node health\" \"$TOTAL_NODES node(s), $UNHEALTHY_COUNT unhealthy\"\nfi\n\nheader \"3. Cluster Events (Recent)\"\n\n# Paginate up to 5 pages (500 events) so the event scan covers incident windows\n# longer than the default page. Long-lived clusters with rolling replacements\n# regularly generate >100 events.\nfetch_cluster_events_cd() {\n local merged='[]' token='' page_json combined i=0 denied=0\n while (( i \u003c 5 )); do\n if [[ -n \"$token\" ]]; then\n page_json=$(aws sagemaker list-cluster-events \\\n --cluster-name \"$CLUSTER\" --region \"$REGION\" \\\n --max-results 100 --next-token \"$token\" \\\n --cli-read-timeout 30 --output json 2>&1) || break\n else\n page_json=$(aws sagemaker list-cluster-events \\\n --cluster-name \"$CLUSTER\" --region \"$REGION\" \\\n --max-results 100 \\\n --cli-read-timeout 30 --output json 2>&1) || break\n fi\n if echo \"$page_json\" | grep -qi \"AccessDenied\\|not authorized\"; then\n denied=1\n break\n fi\n combined=$(python3 -c \"\nimport sys, json\ntry:\n prev = json.loads(sys.argv[1])\n page = json.loads(sys.argv[2])\nexcept json.JSONDecodeError:\n # Malformed page response — stop paginating; caller falls through on break.\n sys.exit(2)\nprev.extend(page.get('ClusterEventSummaries', []))\nprint(json.dumps(prev))\nprint(page.get('NextToken',''))\n\" \"$merged\" \"$page_json\" 2>/dev/null) || break\n\n merged=$(printf '%s\\n' \"$combined\" | sed -n '1p')\n\n token=$(printf '%s\\n' \"$combined\" | sed -n '2p')\n i=$((i+1))\n [[ -z \"$token\" ]] && break\n done\n if (( denied )); then\n echo \"__AUTH_DENIED__\"\n return 1\n fi\n python3 -c \"import sys, json; print(json.dumps({'ClusterEventSummaries': json.loads(sys.argv[1])}))\" \"$merged\" \\\n 2>/dev/null || echo '{\"ClusterEventSummaries\":[]}'\n}\n\nEVENTS_JSON=$(fetch_cluster_events_cd)\nif [[ \"$EVENTS_JSON\" == \"__AUTH_DENIED__\" ]]; then\n warn \"list-cluster-events\" \"IAM permission denied — add sagemaker:ListClusterEvents to your role\"\n EVENTS_JSON='{\"ClusterEventSummaries\":[]}'\nfi\n\nEVENT_COUNT=$(echo \"$EVENTS_JSON\" | python3 -c \"import sys,json; print(len(json.load(sys.stdin).get('ClusterEventSummaries',[])))\" 2>/dev/null || echo 0)\n\nif [[ \"$EVENT_COUNT\" -eq 0 ]]; then\n info \"No cluster events found\"\n if [[ \"$ORCHESTRATOR\" == \"Slurm\" ]]; then\n info \"(Cluster events may not be available for HyperPod Slurm clusters)\"\n fi\nelse\n echo \"$EVENTS_JSON\" | python3 -c \"\nimport sys, json\nevents = json.load(sys.stdin).get('ClusterEventSummaries', [])\n\n# Issue pattern mapping\nISSUE_PATTERNS = {\n 'EFA health checks': 'EFA health check failure → references/cluster-diagnostics-detail.md § A',\n 'Insufficient capacity': 'Capacity error → references/cluster-diagnostics-detail.md § B',\n 'No subnets in the capacity': 'AZ/subnet mismatch → references/cluster-diagnostics-detail.md § B',\n 'Lifecycle scripts did not run': 'Lifecycle script failure → references/cluster-diagnostics-detail.md § C',\n 'Lifecycle scripts execution timed out': 'Lifecycle script timeout → references/cluster-diagnostics-detail.md § C',\n 'network misconfiguration': 'Network misconfiguration → references/cluster-diagnostics-detail.md § A + § B',\n 'hardware failure': 'Hardware failure → delegate to node-debugger',\n 'Failed to provision': 'Provisioning failure → references/cluster-diagnostics-detail.md § B or § E',\n 'replace': 'Node replacement activity → references/cluster-diagnostics-detail.md § G',\n 'reboot': 'Node reboot activity → references/cluster-diagnostics-detail.md § G',\n}\n\nfor e in events[:20]:\n ts = str(e.get('EventTime', '?'))[:19]\n etype = e.get('EventType', '?')\n msg = e.get('Message', '?')[:120]\n print(f' [{ts}] {etype}: {msg}')\n\n msg_lower = (e.get('Message','') or '').lower()\n for pattern, hint in ISSUE_PATTERNS.items():\n if pattern.lower() in msg_lower:\n print(f' [ISSUE] {hint}')\n break\n\" 2>/dev/null\nfi\n\nheader \"4. VPC & Security Group Configuration\"\n\nSUBNET_IDS=$(echo \"$CLUSTER_JSON\" | python3 -c \"\nimport sys,json\nd=json.load(sys.stdin)\nprint(' '.join(d.get('VpcConfig',{}).get('Subnets',[])))\n\" 2>/dev/null || echo \"\")\n\nSG_IDS=$(echo \"$CLUSTER_JSON\" | python3 -c \"\nimport sys,json\nd=json.load(sys.stdin)\nprint(' '.join(d.get('VpcConfig',{}).get('SecurityGroupIds',[])))\n\" 2>/dev/null || echo \"\")\n\nif [[ -z \"$SUBNET_IDS\" ]]; then\n warn \"VpcConfig\" \"No VpcConfig found in cluster\"\nelse\n info \"Subnets: $SUBNET_IDS\"\n info \"Security Groups: $SG_IDS\"\n\n IFS=' ' read -ra _subnet_ids_arr \u003c\u003c\u003c \"$SUBNET_IDS\"\n SUBNET_JSON=$(aws ec2 describe-subnets \\\n --subnet-ids \"${_subnet_ids_arr[@]}\" \\\n --region \"$REGION\" \\\n --cli-read-timeout 30 \\\n --output json 2>&1) || {\n SUB_ERR=\"$SUBNET_JSON\"\n if echo \"$SUB_ERR\" | grep -qi \"AccessDenied\\|UnauthorizedOperation\\|not authorized\"; then\n warn \"describe-subnets\" \"IAM permission denied — add ec2:DescribeSubnets to your role\"\n fi\n SUBNET_JSON='{\"Subnets\":[]}'\n }\n\n _SUBNET_CHECK=$(echo \"$SUBNET_JSON\" | python3 -c \"\nimport sys, json\nsubnets = json.load(sys.stdin).get('Subnets', [])\nvpcs = set()\nfor s in subnets:\n sid = s.get('SubnetId', '?')\n vpc = s.get('VpcId', '?')\n az = s.get('AvailabilityZone', '?')\n free = s.get('AvailableIpAddressCount', 0)\n flag = ' LOW IPs' if free \u003c 10 else ''\n print(f' {sid}: VPC={vpc} AZ={az} FreeIPs={free}{flag}')\n vpcs.add(vpc)\nif len(vpcs) > 1:\n print('MULTI_VPC=true')\n print('VPC_LIST=' + ','.join(vpcs))\nelse:\n print('MULTI_VPC=false')\n v = vpcs.pop() if vpcs else '?'\n print('VPC_ID=' + v)\n\" 2>/dev/null || echo \"\")\n\n while IFS= read -r line; do\n if [[ \"$line\" == \"MULTI_VPC=true\" ]]; then\n fail \"Subnet VPC alignment\" \"Subnets are in DIFFERENT VPCs — all must be in the same VPC\"\n add_issue \"Subnets in different VPCs → references/cluster-diagnostics-detail.md § B (Capacity & AZ)\" \"P0\"\n fi\n if [[ \"$line\" != MULTI_VPC=* && \"$line\" != VPC_ID=* && \"$line\" != VPC_LIST=* ]]; then\n echo \"$line\"\n fi\n done \u003c\u003c\u003c \"$_SUBNET_CHECK\"\n\n # SG self-referencing rules are an EFA requirement.\n # shellcheck disable=SC2086 # intentional word-split on space-separated SG IDs\n for SG in $SG_IDS; do\n SG_RESULT=$(aws ec2 describe-security-groups \\\n --group-ids \"$SG\" \\\n --region \"$REGION\" \\\n --cli-read-timeout 30 \\\n --output json 2>&1)\n if echo \"$SG_RESULT\" | grep -qiE \"AccessDenied|UnauthorizedOperation\"; then\n warn \"describe-security-groups\" \"IAM permission denied for $SG — SG check skipped\"\n continue\n fi\n SG_JSON=\"${SG_RESULT}\"\n [[ -z \"$SG_JSON\" || \"$SG_JSON\" == *\"error\"* ]] && SG_JSON='{\"SecurityGroups\":[]}'\n\n _SG_CHECK=$(echo \"$SG_JSON\" | check_sg_self_ref \"$SG\")\n\n while IFS= read -r line; do\n [[ -z \"$line\" ]] && continue\n level=$(echo \"$line\" | cut -d: -f1)\n msg=$(echo \"$line\" | cut -d: -f2-)\n case \"$level\" in\n PASS) pass \"$msg\" ;;\n FAIL) fail \"$msg\"\n if echo \"$msg\" | grep -q \"Inbound self-ref MISSING\"; then\n add_issue \"Security group $SG inbound self-ref MISSING → references/cluster-diagnostics-detail.md § A (EFA Health Checks)\" \"P0\"\n elif echo \"$msg\" | grep -q \"Outbound self-ref MISSING\"; then\n add_issue \"Security group $SG outbound self-ref MISSING → references/cluster-diagnostics-detail.md § A (EFA Health Checks)\" \"P0\"\n elif echo \"$msg\" | grep -q \"Outbound 0.0.0.0/0 missing\"; then\n add_issue \"Security group $SG outbound 0.0.0.0/0 MISSING → references/cluster-diagnostics-detail.md § A (EFA Health Checks)\" \"P0\"\n else\n add_issue \"Security group $SG rule missing → references/cluster-diagnostics-detail.md § A (EFA Health Checks)\" \"P0\"\n fi\n ;;\n WARN) warn \"$msg\" ;;\n SKIP) info \"$msg\" ;;\n esac\n done \u003c\u003c\u003c \"$_SG_CHECK\"\n done\nfi\n\nheader \"4b. Instance Quotas\"\n\nINSTANCE_TYPES=$(echo \"$CLUSTER_JSON\" | python3 -c \"\nimport sys,json\nd=json.load(sys.stdin)\ntypes=set(g.get('InstanceType','') for g in d.get('InstanceGroups',[]))\nprint(' '.join(t for t in types if t))\n\" 2>/dev/null || echo \"\")\n\nif [[ -n \"$INSTANCE_TYPES\" ]]; then\n # One paginated list-service-quotas call, cached across all instance types.\n # The API is account/region rate-limited and throttles if called per-type.\n QUOTA_ALL=\"\"\n QUOTA_ERR=\"\"\n _next=\"\"\n for _pg in 1 2 3 4 5; do\n if [[ -n \"$_next\" ]]; then\n _raw=$(aws service-quotas list-service-quotas \\\n --service-code sagemaker --region \"$REGION\" \\\n --cli-read-timeout 15 --starting-token \"$_next\" \\\n --output json 2>&1 || true)\n else\n _raw=$(aws service-quotas list-service-quotas \\\n --service-code sagemaker --region \"$REGION\" \\\n --cli-read-timeout 15 \\\n --output json 2>&1 || true)\n fi\n # Order matters: test for specific errors first, then fall through to\n # generic \"not JSON\" check, so throttled responses don't get misclassified.\n if echo \"$_raw\" | grep -qiE \"AccessDenied|UnauthorizedOperation\"; then\n QUOTA_ERR=\"denied\"; break\n elif echo \"$_raw\" | grep -qiE \"TooManyRequestsException|ThrottlingException|RequestLimitExceeded|exceeded the rate\"; then\n QUOTA_ERR=\"throttled\"; break\n elif ! echo \"$_raw\" | head -c 1 | grep -q '{'; then\n QUOTA_ERR=\"api-error\"; break\n fi\n _pg_quotas=$(echo \"$_raw\" | python3 -c \"import sys,json; d=json.load(sys.stdin); print(json.dumps(d.get('Quotas',[])))\" 2>/dev/null || echo \"[]\")\n if [[ \"$_pg_quotas\" != \"[]\" ]]; then\n if [[ -z \"$QUOTA_ALL\" ]]; then\n QUOTA_ALL=\"$_pg_quotas\"\n else\n QUOTA_ALL=$(python3 -c \"import sys,json; a=json.loads(sys.argv[1]); b=json.loads(sys.argv[2]); print(json.dumps(a+b))\" \"$QUOTA_ALL\" \"$_pg_quotas\")\n fi\n fi\n _next=$(echo \"$_raw\" | python3 -c \"import sys,json; print(json.load(sys.stdin).get('NextToken','') or '')\" 2>/dev/null || echo \"\")\n [[ -z \"$_next\" ]] && break\n done\n\n case \"$QUOTA_ERR\" in\n denied) warn \"list-service-quotas\" \"IAM permission denied — quota check skipped\" ;;\n throttled) warn \"list-service-quotas\" \"Throttled — quota check skipped (retry later)\" ;;\n api-error) warn \"list-service-quotas\" \"API call failed — quota check skipped\" ;;\n esac\n\n if [[ -n \"$QUOTA_ALL\" && -z \"$QUOTA_ERR\" ]]; then\n for ITYPE in $INSTANCE_TYPES; do\n QUOTA_VAL=$(python3 -c \"\nimport sys, json\nquotas = json.loads(sys.argv[1])\nitype = sys.argv[2]\n# Match quotas that reference the instance type AND HyperPod\nmatches = [q for q in quotas if itype in q.get('QuotaName','') and 'HyperPod' in q.get('QuotaName','')]\nif matches:\n q = matches[0]\n print(f\\\"{q.get('QuotaName','?')}: {int(q.get('Value',0))}\\\")\nelse:\n print('NOT_FOUND')\n\" \"$QUOTA_ALL\" \"$ITYPE\" 2>/dev/null || echo \"NOT_FOUND\")\n if [[ \"$QUOTA_VAL\" == \"NOT_FOUND\" ]]; then\n info \"Quota for $ITYPE: not found in the SageMaker quota list (check Service Quotas console)\"\n else\n info \"Quota: $QUOTA_VAL\"\n fi\n done\n fi\nelse\n info \"No instance types found in cluster config\"\nfi\n\nif [[ \"$ORCHESTRATOR\" == \"EKS\" && -n \"$EKS_NAME\" ]]; then\n header \"5. EKS Configuration\"\n\n EKS_AUTH=$(aws eks describe-cluster \\\n --name \"$EKS_NAME\" \\\n --region \"$REGION\" \\\n --query 'cluster.accessConfig.authenticationMode' \\\n --output text 2>/dev/null || echo \"unknown\")\n\n if [[ \"$EKS_AUTH\" == \"CONFIG_MAP\" ]]; then\n warn \"EKS auth mode\" \"CONFIG_MAP-only — access entries require API or API_AND_CONFIG_MAP\"\n add_issue \"EKS auth mode is CONFIG_MAP — access entries unavailable until switched (see EKS access-entries docs) → references/cluster-diagnostics-detail.md § D (EKS Access / kubectl)\" \"P2\"\n elif [[ \"$EKS_AUTH\" == \"API\" || \"$EKS_AUTH\" == \"API_AND_CONFIG_MAP\" ]]; then\n pass \"EKS auth mode\" \"$EKS_AUTH\"\n else\n warn \"EKS auth mode\" \"Could not determine ($EKS_AUTH)\"\n fi\n\n # Check access entries for current identity. AWS CLI paginates JSON output by\n # token, so paginate explicitly to handle accounts with many principals.\n info \"Current IAM identity: $CALLER_ARN\"\n\n fetch_all_access_entries() {\n local merged='[]' token='' page_json combined i=0\n while (( i \u003c 20 )); do\n if [[ -n \"$token\" ]]; then\n page_json=$(aws eks list-access-entries --cluster-name \"$EKS_NAME\" --region \"$REGION\" \\\n --next-token \"$token\" --output json 2>/dev/null) || break\n else\n page_json=$(aws eks list-access-entries --cluster-name \"$EKS_NAME\" --region \"$REGION\" \\\n --output json 2>/dev/null) || break\n fi\n combined=$(python3 -c \"\nimport sys, json\nprev = json.loads(sys.argv[1])\npage = json.loads(sys.argv[2])\nprev.extend(page.get('accessEntries', []))\nprint(json.dumps(prev))\nprint(page.get('nextToken',''))\n\" \"$merged\" \"$page_json\" 2>/dev/null) || break\n\n merged=$(printf '%s\\n' \"$combined\" | sed -n '1p')\n\n token=$(printf '%s\\n' \"$combined\" | sed -n '2p')\n i=$((i+1))\n [[ -z \"$token\" ]] && break\n done\n echo \"$merged\"\n }\n ACCESS_ENTRIES=$(fetch_all_access_entries)\n [[ -z \"$ACCESS_ENTRIES\" ]] && ACCESS_ENTRIES='[]'\n\n ENTRY_COUNT=$(echo \"$ACCESS_ENTRIES\" | python3 -c \"import sys,json; print(len(json.load(sys.stdin)))\" 2>/dev/null || echo 0)\n info \"Access entries: $ENTRY_COUNT configured\"\n\n # Strip session name for role-based ARNs\n CALLER_BASE=$(echo \"$CALLER_ARN\" | python3 -c \"\nimport sys\narn = sys.stdin.read().strip()\n# Convert assumed-role ARN to role ARN for matching\n# arn:aws:sts::ACCOUNT:assumed-role/ROLE/SESSION -> arn:aws:iam::ACCOUNT:role/ROLE\nif ':assumed-role/' in arn:\n parts = arn.split(':')\n role_path = parts[-1].replace('assumed-role/', 'role/')\n role_path = '/'.join(role_path.split('/')[:2]) # remove session name\n parts[-1] = role_path\n parts[2] = 'iam'\n parts[3] = '' # IAM ARNs have no region\n print(':'.join(parts))\nelse:\n print(arn)\n\" 2>/dev/null || echo \"$CALLER_ARN\")\n\n HAS_ACCESS=$(echo \"$ACCESS_ENTRIES\" | CALLER_BASE_ENV=\"$CALLER_BASE\" python3 -c \"\nimport sys, json, os\nentries = json.load(sys.stdin)\ncaller = os.environ['CALLER_BASE_ENV']\nfound = any(caller in str(e) for e in entries)\nprint('true' if found else 'false')\n\" 2>/dev/null || echo \"false\")\n\n if [[ \"$HAS_ACCESS\" == \"true\" ]]; then\n pass \"EKS access entry\" \"current identity has an access entry\"\n else\n warn \"EKS access entry\" \"current identity ($CALLER_BASE) may not have an access entry — kubectl may fail\"\n add_issue \"Current IAM identity may lack EKS access → references/cluster-diagnostics-detail.md § D (EKS Access / kubectl)\" \"P1\"\n fi\n\n if command -v kubectl &>/dev/null; then\n KUBECTL_TEST=$(kubectl cluster-info 2>&1 || true)\n if echo \"$KUBECTL_TEST\" | grep -q \"Kubernetes control plane\\|running at\"; then\n pass \"kubectl connectivity\" \"can reach EKS API server\"\n\n if kubectl get namespace aws-hyperpod &>/dev/null 2>&1; then\n pass \"aws-hyperpod namespace\" \"exists\"\n else\n warn \"aws-hyperpod namespace\" \"missing → references/cluster-diagnostics-detail.md § D (EKS Access / kubectl)\"\n fi\n\n # Node count. Note: `wc -l` never fails; avoid `|| echo 0` which would produce \"0\\n0\".\n K8S_NODE_COUNT=$(kubectl get nodes --no-headers 2>/dev/null | wc -l | tr -d ' ')\n K8S_NODE_COUNT=${K8S_NODE_COUNT:-0}\n info \"Kubernetes nodes visible: $K8S_NODE_COUNT\"\n\n if [[ \"$K8S_NODE_COUNT\" -eq 0 && \"$TOTAL_NODES\" -gt 0 ]]; then\n warn \"K8s nodes\" \"0 K8s nodes but $TOTAL_NODES HyperPod nodes — nodes may not have registered with EKS\"\n add_issue \"Nodes not visible in kubectl → references/cluster-diagnostics-detail.md § E (Cluster Provisioning)\" \"P1\"\n fi\n\n HEALTH_LABELS=$(kubectl get nodes -o custom-columns='NODE:.metadata.name,HEALTH:.metadata.labels.sagemaker\\.amazonaws\\.com/node-health-status' --no-headers 2>/dev/null || true)\n if [[ -n \"$HEALTH_LABELS\" ]]; then\n UNHEALTHY_K8S=$(echo \"$HEALTH_LABELS\" | grep -v \"\u003cnone>\" | grep -viE \"Schedulable$\" || true)\n if [[ -n \"$UNHEALTHY_K8S\" ]]; then\n warn \"EKS node health labels\" \"non-schedulable nodes detected:\"\n echo \"$UNHEALTHY_K8S\" | while IFS= read -r line; do info \" $line\"; done\n add_issue \"EKS nodes with health issues → delegate to hyperpod-node-debugger skill; references/cluster-diagnostics-detail.md § G (Node Replacement)\" \"P1\"\n else\n pass \"EKS node health labels\" \"all nodes schedulable\"\n fi\n fi\n\n # Dangling node detection — nodes visible in EKS but not in HyperPod list\n # (or vice versa). Happens after failed scale-up, rollback, or orphaned\n # kubelet registrations.\n if [[ \"$K8S_NODE_COUNT\" -gt 0 && \"$TOTAL_NODES\" -gt 0 ]]; then\n HP_INSTANCES=$(echo \"$NODE_LIST\" | python3 -c \"\nimport sys,json\nnodes=json.load(sys.stdin).get('ClusterNodeSummaries',[])\nfor n in nodes:\n iid=n.get('InstanceId','')\n if iid: print(iid)\n\" 2>/dev/null | sort -u)\n EKS_INSTANCES=$(kubectl get nodes -l sagemaker.amazonaws.com/compute-type=hyperpod \\\n -o jsonpath='{range .items[*]}{.spec.providerID}{\"\\n\"}{end}' 2>/dev/null \\\n | awk -F/ '{print $NF}' | grep -E '^i-' | sort -u || true)\n if [[ -n \"$HP_INSTANCES\" && -n \"$EKS_INSTANCES\" ]]; then\n DANGLING=$(comm -13 \u003c(echo \"$HP_INSTANCES\") \u003c(echo \"$EKS_INSTANCES\"))\n ORPHANED=$(comm -23 \u003c(echo \"$HP_INSTANCES\") \u003c(echo \"$EKS_INSTANCES\"))\n if [[ -n \"$DANGLING\" ]]; then\n warn \"Dangling nodes\" \"visible in EKS but not in HyperPod ($(echo \"$DANGLING\" | wc -l))\"\n echo \"$DANGLING\" | head -5 | while IFS= read -r iid; do info \" EKS-only: $iid\"; done\n add_issue \"Dangling EKS nodes (present in kubectl, absent from list-cluster-nodes) → references/cluster-diagnostics-detail.md § K (Dangling Nodes & Cleanup)\" \"P1\"\n fi\n if [[ -n \"$ORPHANED\" ]]; then\n warn \"Orphaned HyperPod nodes\" \"visible in HyperPod but not in EKS ($(echo \"$ORPHANED\" | wc -l))\"\n echo \"$ORPHANED\" | head -5 | while IFS= read -r iid; do info \" HyperPod-only: $iid\"; done\n add_issue \"HyperPod nodes not registered in EKS → references/cluster-diagnostics-detail.md § E (Cluster Provisioning); delegate to hyperpod-node-debugger\" \"P1\"\n fi\n [[ -z \"$DANGLING\" && -z \"$ORPHANED\" ]] && pass \"Node reconciliation\" \"EKS and HyperPod views match\"\n fi\n fi\n\n # EKS add-on health — VPC CNI, CoreDNS, kube-proxy failures break pod networking.\n # Add-on count is small in practice (\u003c10) so a single page of 100 is always sufficient.\n if [[ -n \"$EKS_NAME\" ]]; then\n ADDON_JSON=$(aws eks list-addons --cluster-name \"$EKS_NAME\" --region \"$REGION\" \\\n --max-results 100 --output json 2>/dev/null || echo '{\"addons\":[]}')\n ADDON_NAMES=$(echo \"$ADDON_JSON\" | python3 -c \"\nimport sys,json\nprint('\\n'.join(json.load(sys.stdin).get('addons',[])))\n\" 2>/dev/null)\n DEGRADED_ADDONS=\"\"\n while IFS= read -r addon; do\n [[ -z \"$addon\" ]] && continue\n A_STATUS=$(aws eks describe-addon --cluster-name \"$EKS_NAME\" --addon-name \"$addon\" \\\n --region \"$REGION\" --query 'addon.status' --output text 2>/dev/null || echo \"UNKNOWN\")\n if [[ \"$A_STATUS\" != \"ACTIVE\" && \"$A_STATUS\" != \"UPDATING\" ]]; then\n DEGRADED_ADDONS+=\"$addon($A_STATUS) \"\n fi\n done \u003c\u003c\u003c \"$ADDON_NAMES\"\n if [[ -n \"$DEGRADED_ADDONS\" ]]; then\n warn \"EKS add-ons\" \"not ACTIVE: $DEGRADED_ADDONS\"\n add_issue \"EKS add-on(s) degraded: $DEGRADED_ADDONS → references/cluster-diagnostics-detail.md § D (EKS Access / kubectl)\" \"P1\"\n else\n [[ -n \"$ADDON_NAMES\" ]] && pass \"EKS add-ons\" \"$(echo \"$ADDON_NAMES\" | wc -l) add-on(s) ACTIVE\"\n fi\n fi\n\n # aws-auth ConfigMap legacy check — deprecated but still load-bearing if cluster auth mode\n # is API_AND_CONFIG_MAP or CONFIG_MAP. Misconfigured entries here can shadow access entries.\n if [[ -n \"$EKS_NAME\" ]]; then\n AUTH_MODE=$(aws eks describe-cluster --name \"$EKS_NAME\" --region \"$REGION\" \\\n --query 'cluster.accessConfig.authenticationMode' --output text 2>/dev/null || echo \"\")\n if [[ \"$AUTH_MODE\" == \"CONFIG_MAP\" || \"$AUTH_MODE\" == \"API_AND_CONFIG_MAP\" ]]; then\n if kubectl -n kube-system get configmap aws-auth >/dev/null 2>&1; then\n AUTH_ENTRIES=$(kubectl -n kube-system get configmap aws-auth -o jsonpath='{.data.mapRoles}' 2>/dev/null | grep -c \"^\" || true)\n AUTH_ENTRIES=${AUTH_ENTRIES:-0}\n info \"aws-auth ConfigMap: $AUTH_ENTRIES mapRoles entries (auth mode: $AUTH_MODE)\"\n if [[ \"$AUTH_MODE\" == \"API_AND_CONFIG_MAP\" ]]; then\n warn \"aws-auth ConfigMap\" \"both ConfigMap and access entries in use — ConfigMap entries can shadow access entries; recommend migrating to API-only mode\"\n fi\n fi\n fi\n fi\n else\n warn \"kubectl connectivity\" \"cannot reach EKS API — check kubeconfig and access entries\"\n add_issue \"kubectl cannot reach EKS → references/cluster-diagnostics-detail.md § D (EKS Access / kubectl)\" \"P1\"\n fi\n else\n info \"kubectl not installed — skipping Kubernetes checks\"\n fi\nelse\n header \"5. Slurm Checks\"\n info \"Orchestrator: Slurm\"\n\n # Warn/issue emitted in section 1; this branch is the PASS-only confirmation.\n if [[ \"$NODE_RECOVERY\" == *\"Automatic\"* ]] && [[ \"$NODE_RECOVERY\" != *\"None\"* ]]; then\n pass \"NodeRecovery\" \"enabled on all instance groups\"\n fi\n\n if command -v session-manager-plugin &>/dev/null && [[ -n \"$CLUSTER_ID\" ]]; then\n header \"5b. Slurm Controller Health (via SSM)\"\n HEAD_NODE_ID=$(echo \"$NODE_LIST\" | python3 -c \"\nimport sys,json\nnodes=json.load(sys.stdin).get('ClusterNodeSummaries',[])\nfor n in nodes:\n g=n.get('InstanceGroupName','').lower()\n if any(x in g for x in ['controller','head','master','login']):\n print(n.get('InstanceId',''))\n break\nelse:\n if nodes:\n print(nodes[0].get('InstanceId',''))\n\" 2>/dev/null || echo \"\")\n\n if [[ -n \"$HEAD_NODE_ID\" ]]; then\n HEAD_GROUP=$(echo \"$NODE_LIST\" | HEAD_NODE_ID_ENV=\"$HEAD_NODE_ID\" python3 -c \"\nimport sys,json,os\ntarget_id = os.environ['HEAD_NODE_ID_ENV']\nnodes=json.load(sys.stdin).get('ClusterNodeSummaries',[])\nfor n in nodes:\n if n.get('InstanceId','') == target_id:\n print(n.get('InstanceGroupName',''))\n break\n\" 2>/dev/null || echo \"\")\n if [[ -z \"$HEAD_GROUP\" ]]; then\n warn \"Controller node\" \"could not resolve instance-group name — SSM check skipped\"\n HEAD_NODE_ID=\"\"\n fi\n fi\n if [[ -n \"$HEAD_NODE_ID\" ]]; then\n SSM_TARGET=\"sagemaker-cluster:${CLUSTER_ID}_${HEAD_GROUP}-${HEAD_NODE_ID}\"\n info \"Controller node: $HEAD_NODE_ID ($HEAD_GROUP)\"\n info \"SSM target: $SSM_TARGET\"\n\n _slurm_nonce=$(date +%s%N 2>/dev/null || echo \"$RANDOM\")\n # Validate nonce is numeric to prevent injection in remote command\n if [[ ! \"$_slurm_nonce\" =~ ^[0-9]+$ ]]; then\n _slurm_nonce=\"$\"\n fi\n SLURM_SH=$(cat \u003c\u003cEOF\nscontrol show config >/dev/null 2>&1\nif [ \\$? -eq 0 ]; then echo SLURM_OK_${_slurm_nonce}; else echo SLURM_DOWN_${_slurm_nonce}; fi\necho NODES_START_${_slurm_nonce}\nsinfo -o '%N %T %30E' --noheader 2>/dev/null | head -20\necho NODES_END_${_slurm_nonce}\necho JOBS_START_${_slurm_nonce}\nsqueue -o '%i %j %T %R' --noheader 2>/dev/null | grep -iE 'COMPLETING|CONFIGURING|PENDING' | head -10 || true\necho JOBS_END_${_slurm_nonce}\necho MUNGE_${_slurm_nonce}\nsystemctl is-active munge 2>/dev/null || echo munge_inactive\necho END_${_slurm_nonce}\nEOF\n)\n STDOUT=$(ssm_run_on_node \"$HEAD_NODE_ID\" \"$HEAD_GROUP\" \"$SLURM_SH\" || echo \"\")\n\n if [[ -n \"$STDOUT\" ]]; then\n if echo \"$STDOUT\" | grep -q \"SLURM_OK_${_slurm_nonce}\"; then\n pass \"slurmctld\" \"responsive\"\n elif echo \"$STDOUT\" | grep -q \"SLURM_DOWN_${_slurm_nonce}\"; then\n fail \"slurmctld\" \"not responding — all Slurm operations blocked\"\n add_issue \"slurmctld down on controller → references/cluster-operations.md § 8 Slurm — controller operations\" \"P0\"\n fi\n\n SLURM_DOWN_NODES=$(echo \"$STDOUT\" | sed -n \"/^NODES_START_${_slurm_nonce}\\$/,/^NODES_END_${_slurm_nonce}\\$/p\" | grep -v \"^NODES_\" | grep -iE \"down|drain|fail\" || true)\n if [[ -n \"$SLURM_DOWN_NODES\" ]]; then\n warn \"Slurm nodes with issues:\"\n echo \"$SLURM_DOWN_NODES\" | while IFS= read -r line; do info \" $line\"; done\n S_DOWN_COUNT=$(echo \"$SLURM_DOWN_NODES\" | grep -c . ; :)\n S_DOWN_COUNT=${S_DOWN_COUNT:-0}\n add_issue \"$S_DOWN_COUNT Slurm node(s) down/drained → references/cluster-diagnostics-detail.md § G (Node Replacement); delegate to hyperpod-node-debugger\" \"P1\"\n else\n pass \"Slurm nodes\" \"all idle/alloc/mixed\"\n fi\n\n STUCK_JOBS=$(echo \"$STDOUT\" | sed -n \"/^JOBS_START_${_slurm_nonce}\\$/,/^JOBS_END_${_slurm_nonce}\\$/p\" | grep -v \"^JOBS_\" || true)\n if [[ -n \"$STUCK_JOBS\" ]]; then\n warn \"Stuck Slurm jobs detected:\"\n echo \"$STUCK_JOBS\" | while IFS= read -r line; do info \" $line\"; done\n add_issue \"Stuck Slurm jobs → references/cluster-operations.md § 8 Slurm — controller operations\" \"P1\"\n fi\n\n if echo \"$STDOUT\" | sed -n \"/^MUNGE_${_slurm_nonce}\\$/,/^END_${_slurm_nonce}\\$/p\" | grep -q \"munge_inactive\"; then\n fail \"munge\" \"authentication service not running — Slurm auth will fail\"\n add_issue \"munge service inactive on controller → references/cluster-operations.md § 8 Slurm — controller operations\" \"P0\"\n fi\n else\n info \"Could not get output from SSM on controller — check ssm:StartSession permission, session-manager-plugin, or node reachability\"\n fi\n else\n info \"Could not identify controller node from node list\"\n fi\n else\n info \"SSM plugin not available — Slurm checks require SSM access to controller\"\n info \"Install SSM plugin to enable Slurm health checks\"\n fi\nfi\n\nheader \"6. SSM Readiness\"\n\nif command -v session-manager-plugin &>/dev/null; then\n if SSM_VERSION=$(session-manager-plugin --version 2>/dev/null); then\n pass \"SSM plugin installed\" \"version: $SSM_VERSION\"\n else\n warn \"SSM plugin\" \"installed but --version failed — plugin may be corrupt\"\n add_issue \"SSM plugin installed but broken → references/cluster-diagnostics-detail.md § F (SSM Connectivity)\" \"P1\"\n fi\nelse\n warn \"SSM plugin\" \"not installed — required for node access (install session-manager-plugin)\"\n add_issue \"SSM plugin not installed → references/cluster-diagnostics-detail.md § F (SSM Connectivity)\" \"P2\"\nfi\n\nif [[ -n \"$CLUSTER_ID\" && \"$TOTAL_NODES\" -gt 0 ]]; then\n FIRST_NODE=$(echo \"$NODE_LIST\" | python3 -c \"\nimport sys, json\nnodes = json.load(sys.stdin).get('ClusterNodeSummaries', [])\nif nodes:\n n = nodes[0]\n nid = n.get('InstanceId', '?')\n group = n.get('InstanceGroupName', '?')\n print(f'{group}-{nid}')\n\" 2>/dev/null || echo \"\")\n\n if [[ -n \"$FIRST_NODE\" ]]; then\n info \"SSM target format: sagemaker-cluster:${CLUSTER_ID}_${FIRST_NODE}\"\n info \"To connect: aws ssm start-session --target sagemaker-cluster:${CLUSTER_ID}_${FIRST_NODE} --region $REGION\"\n fi\nfi\n\nif [[ -n \"$SUBNET_IDS\" ]]; then\n header \"6b. VPC Endpoints\"\n\n FIRST_SUBNET=$(echo \"$SUBNET_IDS\" | awk '{print $1}')\n VPC_FOR_ENDPOINTS=$(aws ec2 describe-subnets \\\n --subnet-ids \"$FIRST_SUBNET\" \\\n --region \"$REGION\" \\\n --cli-read-timeout 15 \\\n --query 'Subnets[0].VpcId' \\\n --output text 2>/dev/null || echo \"\")\n\n if [[ -n \"$VPC_FOR_ENDPOINTS\" && \"$VPC_FOR_ENDPOINTS\" != \"None\" ]]; then\n EP_RESULT=$(aws ec2 describe-vpc-endpoints \\\n --filters \"Name=vpc-id,Values=$VPC_FOR_ENDPOINTS\" \\\n --region \"$REGION\" \\\n --cli-read-timeout 15 \\\n --query \"VpcEndpoints[?State==\\`available\\`].ServiceName\" \\\n --output text 2>&1)\n if echo \"$EP_RESULT\" | grep -qiE \"AccessDenied|UnauthorizedOperation\"; then\n warn \"describe-vpc-endpoints\" \"IAM permission denied — VPC endpoint check skipped\"\n EP_RESULT=\"\"\n fi\n ENDPOINTS=\"${EP_RESULT}\"\n\n # s3 → Lifecycle scripts (S3 bucket download path)\n # ssm/ssmmessages/ec2messages → SSM connectivity (§ F)\n for SVC in s3 ssm ssmmessages ec2messages; do\n if echo \"$ENDPOINTS\" | grep -qE \"(^|[.])${SVC}($|[[:space:]])\"; then\n pass \"VPC endpoint: $SVC\"\n else\n warn \"VPC endpoint: $SVC\" \"not found — required only if the cluster subnet has no NAT/IGW path out\"\n case \"$SVC\" in\n s3) add_issue \"VPC endpoint not found for s3 → references/cluster-diagnostics-detail.md § C (Lifecycle Scripts)\" \"P2\" ;;\n ssm|ssmmessages|ec2messages)\n add_issue \"VPC endpoint not found for $SVC → references/cluster-diagnostics-detail.md § F (SSM Connectivity)\" \"P2\" ;;\n esac\n fi\n done\n else\n info \"Could not determine VPC ID for endpoint check\"\n fi\nfi\n\nheader \"7. CloudWatch Logs\"\n\nif [[ -n \"$CLUSTER_ID\" ]]; then\n # CW log groups follow /aws/sagemaker/Clusters/\u003cCLUSTER_NAME>/\u003cCLUSTER_ID>,\n # where \u003cCLUSTER_NAME> is the human-readable name (not the ARN short-id).\n CLUSTER_NAME_FOR_LOGS=$(echo \"$CLUSTER_JSON\" | python3 -c \"\nimport sys, json\ntry:\n d = json.load(sys.stdin)\n n = d.get('ClusterName', '')\n print(n if n else '')\nexcept Exception:\n print('')\n\" 2>/dev/null)\n # Fall back to the value the caller supplied, unless it looks like an ARN.\n if [[ -z \"$CLUSTER_NAME_FOR_LOGS\" ]]; then\n if [[ \"$CLUSTER\" == arn:aws:* ]]; then\n CLUSTER_NAME_FOR_LOGS=\"$CLUSTER_ID\" # best-effort; will probe the prefix below\n else\n CLUSTER_NAME_FOR_LOGS=\"$CLUSTER\"\n fi\n fi\n\n LOG_GROUP=\"/aws/sagemaker/Clusters/${CLUSTER_NAME_FOR_LOGS}/${CLUSTER_ID}\"\n\n LOG_RESULT=$(aws logs describe-log-groups \\\n --log-group-name-prefix \"$LOG_GROUP\" \\\n --region \"$REGION\" \\\n --query 'logGroups[0].logGroupName' \\\n --output text 2>&1)\n if echo \"$LOG_RESULT\" | grep -qiE \"AccessDenied|UnauthorizedOperation\"; then\n warn \"describe-log-groups\" \"IAM permission denied — CloudWatch log check skipped\"\n LOG_RESULT=\"None\"\n fi\n LOG_EXISTS=\"${LOG_RESULT:-None}\"\n\n if [[ \"$LOG_EXISTS\" != \"None\" && -n \"$LOG_EXISTS\" ]]; then\n pass \"CloudWatch log group\" \"$LOG_GROUP\"\n\n # Use the server-side prefix filter; clusters with hundreds of nodes have\n # hundreds of streams and the default first-page result truncates.\n count_log_streams_by_prefix() {\n local prefix=\"$1\"\n local merged='[]' token='' page_json combined i=0\n while (( i \u003c 20 )); do\n if [[ -n \"$token\" ]]; then\n page_json=$(aws logs describe-log-streams \\\n --log-group-name \"$LOG_GROUP\" --region \"$REGION\" \\\n --log-stream-name-prefix \"$prefix\" --limit 50 --next-token \"$token\" \\\n --output json 2>/dev/null) || break\n else\n page_json=$(aws logs describe-log-streams \\\n --log-group-name \"$LOG_GROUP\" --region \"$REGION\" \\\n --log-stream-name-prefix \"$prefix\" --limit 50 \\\n --output json 2>/dev/null) || break\n fi\n combined=$(python3 -c \"\nimport sys, json\nprev = json.loads(sys.argv[1])\npage = json.loads(sys.argv[2])\nprev.extend(s.get('logStreamName','') for s in page.get('logStreams', []))\nprint(json.dumps(prev))\nprint(page.get('nextToken',''))\n\" \"$merged\" \"$page_json\" 2>/dev/null) || break\n\n merged=$(printf '%s\\n' \"$combined\" | sed -n '1p')\n\n token=$(printf '%s\\n' \"$combined\" | sed -n '2p')\n i=$((i+1))\n [[ -z \"$token\" ]] && break\n done\n echo \"$merged\" | python3 -c \"import sys,json; print(len(json.load(sys.stdin)))\" 2>/dev/null || echo 0\n }\n\n LC_COUNT=$(count_log_streams_by_prefix \"LifecycleConfig\")\n HM_COUNT=$(count_log_streams_by_prefix \"SagemakerHealthMonitoringAgent\")\n\n info \"Lifecycle log streams: $LC_COUNT\"\n info \"Health monitoring log streams: $HM_COUNT\"\n\n if [[ \"$LC_COUNT\" -eq 0 && \"$CLUSTER_STATUS\" != \"Creating\" ]]; then\n warn \"Lifecycle logs\" \"no lifecycle log streams found — scripts may not have run\"\n fi\n else\n warn \"CloudWatch log group\" \"not found: $LOG_GROUP\"\n info \"Logs may not be available if cluster creation failed early\"\n info \"Check IAM execution role has CloudWatch Logs write permissions\"\n add_issue \"CloudWatch log group not found → references/cluster-diagnostics-detail.md § C (Lifecycle Scripts)\" \"P2\"\n fi\nfi\n\necho \"\"\necho -e \"${BOLD}========================================${NC}\"\necho -e \"${BOLD} DIAGNOSTIC SUMMARY ${NC}\"\necho -e \"${BOLD}========================================${NC}\"\necho \"\"\n\necho -e \" Cluster: ${BOLD}${CLUSTER}${NC} (${ORCHESTRATOR})\"\necho -e \" Status: ${CLUSTER_STATUS}\"\necho -e \" Results: ${RED}${CRITICAL_FAILURES} critical${NC} | ${YELLOW}${WARNINGS} warnings${NC}\"\necho -e \" Mode: READ-ONLY (no changes made; each [FAIL] points to a references section)\"\necho \"\"\n\nif [[ ${#ISSUES_FOUND[@]} -gt 0 ]]; then\n echo -e \"${BOLD} Issues Found (prioritized):${NC}\"\n for priority in P0 P1 P2; do\n has_priority=false\n for issue in \"${ISSUES_FOUND[@]}\"; do\n if [[ \"$issue\" == \"${priority}|\"* ]]; then\n if ! \"$has_priority\"; then\n case \"$priority\" in\n P0) echo -e \" ${RED}${BOLD}[$priority — Fix Immediately]${NC}\" ;;\n P1) echo -e \" ${YELLOW}${BOLD}[$priority — Fix Soon]${NC}\" ;;\n P2) echo -e \" ${BOLD}[$priority — Informational]${NC}\" ;;\n esac\n has_priority=true\n fi\n echo -e \" → ${issue#*|}\"\n fi\n done\n done\n echo \"\"\nfi\n\nif [[ $CRITICAL_FAILURES -eq 0 && $WARNINGS -eq 0 ]]; then\n echo -e \" ${GREEN}${BOLD}All cluster-level checks passed.${NC}\"\n echo \" If issues persist, try:\"\n echo \" - hyperpod-node-debugger skill for per-node issues\"\n echo \" - hyperpod-nccl skill for NCCL/training issues\"\nelif [[ $CRITICAL_FAILURES -eq 0 ]]; then\n echo -e \" ${YELLOW}${BOLD}No critical issues, but $WARNINGS warning(s) found.${NC}\"\n echo \" Review [WARN] items above.\"\nelse\n echo -e \" ${RED}${BOLD}$CRITICAL_FAILURES critical issue(s) found.${NC}\"\n echo \" Fix [FAIL] items above. See SKILL.md for detailed resolution steps.\"\nfi\necho \"\"\n\nexit \"$([[ $CRITICAL_FAILURES -eq 0 ]] && echo 0 || echo 1)\"\n","content_type":"application/x-sh; charset=utf-8","language":"bash","size":70693,"content_sha256":"3a1cdd824abd39aa621f309b49afa5dc197fa906cc6e7948832c785f85b3ea5a"}],"content_json":{"type":"doc","content":[{"type":"heading","attrs":{"level":1},"content":[{"text":"HyperPod Cluster Debugger","type":"text"}]},{"type":"paragraph","content":[{"text":"Operating policy.","type":"text","marks":[{"type":"strong"}]},{"text":" Run read-only diagnostics yourself. Never run a command that changes cluster, node, or workload state — present each one as a ","type":"text"},{"text":"Suggested command (run this yourself)","type":"text","marks":[{"type":"strong"}]},{"text":" block and wait for the customer to run it. Destructive order: ","type":"text"},{"text":"investigate → reboot → replace","type":"text","marks":[{"type":"strong"}]},{"text":" (replace destroys root + secondary volumes; not supported on Slurm controller nodes).","type":"text"}]},{"type":"paragraph","content":[{"text":"Before any state-changing CLI: ask if it's IaC-managed.","type":"text","marks":[{"type":"strong"}]},{"text":" HyperPod clusters, SGs, EKS access entries, and IAM are usually provisioned via CloudFormation / CDK / Terraform. If yes, the fix belongs in IaC — running the CLI will drift and the next deploy reverts it. Use the CLI only when IaC is unavailable (locked out, predates IaC, mid-review).","type":"text"}]},{"type":"paragraph","content":[{"text":"scripts/diagnose-cluster.sh","type":"text","marks":[{"type":"code_inline"}]},{"text":" is read-only: it collects state via AWS APIs (and SSM for Slurm controller health) and prints each issue as ","type":"text"},{"text":"[FAIL] ... → references/\u003cfile>.md § \u003csection>","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Reference","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Open when","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"cluster-diagnostics-detail.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Per-finding remediation runbook (§ A–L)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"cluster-operations.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-operations.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Operational deep-dives (EFA SG, EKS access, SSM, Slurm, filesystem)","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"cloudformation-errors.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/cloudformation-errors.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"§ H needs the full per-resource CFN error catalog","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"capacity-planning.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/capacity-planning.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"§ B or ","type":"text"},{"text":"--validate","type":"text","marks":[{"type":"code_inline"}]},{"text":" flags capacity / subnet sizing","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"lifecycle-scripts.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/lifecycle-scripts.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"§ C points at a specific lifecycle failure","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"iam-permissions.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/iam-permissions.md","title":null}}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Full IAM policy for the diagnostic","type":"text"}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Workflow","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Collect HyperPod cluster name (not EKS name), region, exact error string.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Run ","type":"text"},{"text":"scripts/diagnose-cluster.sh","type":"text","marks":[{"type":"code_inline"}]},{"text":" (or ","type":"text"},{"text":"--validate","type":"text","marks":[{"type":"code_inline"}]},{"text":" for pre-create).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"For every ","type":"text"},{"text":"[FAIL]","type":"text","marks":[{"type":"code_inline"}]},{"text":" line, ","type":"text"},{"text":"Read","type":"text","marks":[{"type":"code_inline"}]},{"text":" the referenced section.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Present finding, root cause, and the Suggested-command block verbatim. Wait for customer approval.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Re-run the diagnostic to confirm.","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Step 1: Run diagnostics","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# Diagnose an existing cluster:\nbash scripts/diagnose-cluster.sh --cluster \u003cCLUSTER_NAME_OR_ARN> --region \u003cREGION>\n\n# Pre-flight (no cluster needed) — validates SGs, subnets, IAM, VPC endpoints,\n# optionally S3 lifecycle scripts and per-AZ capacity:\nbash scripts/diagnose-cluster.sh --validate --region \u003cREGION> \\\n --sg-ids \u003csg-1,sg-2> --subnet-ids \u003csub-1,sub-2> [--iam-role \u003crole-arn>] \\\n [--s3-uri s3://\u003cBUCKET>/path/] [--instance-type ml.p5.48xlarge]","type":"text"}]},{"type":"paragraph","content":[{"text":"Pass ","type":"text"},{"text":"--instance-type","type":"text","marks":[{"type":"code_inline"}]},{"text":" when the target instance type is known — enables the per-AZ capacity check (warns if none of the provided subnets are in an AZ that offers that type, which causes insufficient-capacity failures at creation time).","type":"text"}]},{"type":"paragraph","content":[{"text":"Tags: ","type":"text"},{"text":"[PASS]","type":"text","marks":[{"type":"code_inline"}]},{"text":" · ","type":"text"},{"text":"[FAIL]","type":"text","marks":[{"type":"code_inline"}]},{"text":" (counted, has ","type":"text"},{"text":"→ references/...","type":"text","marks":[{"type":"code_inline"}]},{"text":" pointer) · ","type":"text"},{"text":"[WARN]","type":"text","marks":[{"type":"code_inline"}]},{"text":" · ","type":"text"},{"text":"[INFO]","type":"text","marks":[{"type":"code_inline"}]},{"text":". Priorities: ","type":"text"},{"text":"P0","type":"text","marks":[{"type":"strong"}]},{"text":" blocks operation · ","type":"text"},{"text":"P1","type":"text","marks":[{"type":"strong"}]},{"text":" degraded · ","type":"text"},{"text":"P2","type":"text","marks":[{"type":"strong"}]},{"text":" informational.","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Step 2: Match signal → section","type":"text"}]},{"type":"paragraph","content":[{"text":"Error messages / events:","type":"text","marks":[{"type":"strong"}]}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Signal","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Section","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"EFA health checks did not run successfully\"","type":"text","marks":[{"type":"code_inline"}]},{"text":" (public-doc verbatim signal)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"A: EFA Health Checks","type":"text","marks":[{"type":"link","attrs":{"href":"#a-efa-health-checks","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Insufficient-capacity or AZ-mismatch failure at creation","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"B: Capacity & AZ","type":"text","marks":[{"type":"link","attrs":{"href":"#b-capacity--az","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Lifecycle-script failure or timeout during provisioning","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"C: Lifecycle Scripts","type":"text","marks":[{"type":"link","attrs":{"href":"#c-lifecycle-scripts","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"kubectl auth error (server asks for credentials / no API group list)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"D: EKS Access","type":"text","marks":[{"type":"link","attrs":{"href":"#d-eks-access--kubectl","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"InService","type":"text","marks":[{"type":"code_inline"}]},{"text":" but not all instances visible","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"E: Cluster Provisioning","type":"text","marks":[{"type":"link","attrs":{"href":"#e-cluster-provisioning","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"Target is not connected\"","type":"text","marks":[{"type":"code_inline"}]},{"text":" / SSM errors","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"F: SSM Connectivity","type":"text","marks":[{"type":"link","attrs":{"href":"#f-ssm-connectivity","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Node replacement not happening / ","type":"text"},{"text":"batch-replace","type":"text","marks":[{"type":"code_inline"}]},{"text":" not working","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"G: Node Replacement","type":"text","marks":[{"type":"link","attrs":{"href":"#g-node-replacement","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"Embedded stack failed\"","type":"text","marks":[{"type":"code_inline"}]},{"text":" / any CloudFormation error","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"H: CloudFormation Errors","type":"text","marks":[{"type":"link","attrs":{"href":"#h-cloudformation-errors","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"UpdateClusterSoftware","type":"text","marks":[{"type":"code_inline"}]},{"text":" failed or cluster in post-maintenance rollback state","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"J: AMI & Cluster Updates","type":"text","marks":[{"type":"link","attrs":{"href":"#j-ami--cluster-updates","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Dangling / orphaned nodes in EKS vs ","type":"text"},{"text":"list-cluster-nodes","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"K: Dangling Nodes & Cleanup","type":"text","marks":[{"type":"link","attrs":{"href":"#k-dangling-nodes--cleanup","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Cluster Autoscaler breaks after HyperPod attached","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"L: Autoscaler Compatibility","type":"text","marks":[{"type":"link","attrs":{"href":"#l-autoscaler-compatibility","title":null}},{"type":"strong"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Slow I/O, FSx throughput saturated","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"cluster-operations.md § 9","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-operations.md","title":null}}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Slurm node name → instance ID lookup","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"I: Utilities","type":"text","marks":[{"type":"link","attrs":{"href":"#i-utilities","title":null}},{"type":"strong"}]}]}]}]}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"A: EFA Health Checks","type":"text"}]},{"type":"paragraph","content":[{"text":"SG missing self-reference. Add inbound + outbound self-ref to every SG on the cluster, plus least-privilege egress for the AWS APIs the node needs (HTTPS 443 to S3 / ECR / SageMaker / SSM / STS / CloudWatch Logs — via VPC-endpoint prefix-lists when possible). Full procedure: ","type":"text"},{"text":"cluster-diagnostics-detail.md § A","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#a-efa-health-checks","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"B: Capacity & AZ","type":"text"}]},{"type":"paragraph","content":[{"text":"Instance type unavailable in the requested AZ. Verify with ","type":"text"},{"text":"describe-instance-type-offerings","type":"text","marks":[{"type":"code_inline"}]},{"text":", then change AZ, use Flexible Training Plans, or request ODCR. Full: ","type":"text"},{"text":"§ B","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#b-capacity--az","title":null}}]},{"text":" · strategy: ","type":"text"},{"text":"capacity-planning.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/capacity-planning.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"C: Lifecycle Scripts","type":"text"}]},{"type":"paragraph","content":[{"text":"Script failed or timed out during provisioning. Read CloudWatch under ","type":"text"},{"text":"/aws/sagemaker/Clusters/\u003cname>/\u003cid>","type":"text","marks":[{"type":"code_inline"}]},{"text":" — common causes: missing S3 VPC endpoint, IAM gap, CRLF line endings, instance-group name mismatch. Full: ","type":"text"},{"text":"§ C","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#c-lifecycle-scripts","title":null}}]},{"text":" · layout: ","type":"text"},{"text":"lifecycle-scripts.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/lifecycle-scripts.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"D: EKS Access / kubectl","type":"text"}]},{"type":"paragraph","content":[{"text":"IAM identity not in EKS access entries. Verify with ","type":"text"},{"text":"sts get-caller-identity","type":"text","marks":[{"type":"code_inline"}]},{"text":", create an access entry with admin policy, update kubeconfig. Full: ","type":"text"},{"text":"§ D","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#d-eks-access--kubectl","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"E: Cluster Provisioning","type":"text"}]},{"type":"paragraph","content":[{"text":"InService","type":"text","marks":[{"type":"code_inline"}]},{"text":" without all instances is expected under Continuous Provisioning — failures surface as events, not cluster errors. For stuck ","type":"text"},{"text":"Creating","type":"text","marks":[{"type":"code_inline"}]},{"text":"/","type":"text"},{"text":"Updating","type":"text","marks":[{"type":"code_inline"}]},{"text":"/","type":"text"},{"text":"Deleting","type":"text","marks":[{"type":"code_inline"}]},{"text":": check CFN nested stacks (§ H), IAM, capacity, events; if stuck ","type":"text"},{"text":"Deleting","type":"text","marks":[{"type":"code_inline"}]},{"text":" check VPC ENI dependencies. Full: ","type":"text"},{"text":"§ E","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#e-cluster-provisioning","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"F: SSM Connectivity","type":"text"}]},{"type":"paragraph","content":[{"text":"Target is not connected","type":"text","marks":[{"type":"code_inline"}]},{"text":": use ","type":"text"},{"text":"sagemaker-cluster:\u003cCLUSTER_ID>_\u003cGROUP>-\u003cINSTANCE_ID>","type":"text","marks":[{"type":"code_inline"}]},{"text":" format (not raw EC2 ID), install session-manager-plugin, confirm node ","type":"text"},{"text":"Running","type":"text","marks":[{"type":"code_inline"}]},{"text":". Check IAM + VPC endpoints on timeouts. Full: ","type":"text"},{"text":"§ F","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#f-ssm-connectivity","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"G: Node Replacement","type":"text"}]},{"type":"paragraph","content":[{"text":"Auto-repair: confirm ","type":"text"},{"text":"NodeRecovery=Automatic","type":"text","marks":[{"type":"code_inline"}]},{"text":", check Health Monitoring Agent (HMA) logs + node labels / Slurm reason, confirm capacity. Manual: reboot first, replace only if reboot fails. Replace requires the cluster to have been patched via ","type":"text"},{"text":"UpdateClusterSoftware","type":"text","marks":[{"type":"code_inline"}]},{"text":" at least once and cannot target a Slurm controller node. Full: ","type":"text"},{"text":"§ G","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#g-node-replacement","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"H: CloudFormation Errors","type":"text"}]},{"type":"paragraph","content":[{"text":"Embedded stack failed","type":"text","marks":[{"type":"code_inline"}]},{"text":" hides the real error. Drill into nested stacks via Events tab (filter Failed) until you reach a non-stack resource. CLI: ","type":"text"},{"text":"describe-stack-events --query 'StackEvents[?ResourceStatus==\\","type":"text","marks":[{"type":"code_inline"}]},{"text":"CREATE_FAILED`]'`. Also covers SLR creation failures and permission-boundary denials. Full: ","type":"text"},{"text":"§ H","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#h-cloudformation-errors","title":null}}]},{"text":" · catalog: ","type":"text"},{"text":"cloudformation-errors.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/cloudformation-errors.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"I: Utilities","type":"text"}]},{"type":"paragraph","content":[{"text":"Map Slurm node names (","type":"text"},{"text":"ip-10-x-y-z","type":"text","marks":[{"type":"code_inline"}]},{"text":") to HyperPod instance IDs via ","type":"text"},{"text":"list-cluster-nodes","type":"text","marks":[{"type":"code_inline"}]},{"text":" or on-node ","type":"text"},{"text":"/opt/ml/config/resource_config.json","type":"text","marks":[{"type":"code_inline"}]},{"text":". Full: ","type":"text"},{"text":"§ I","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#i-utilities","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"J: AMI & Cluster Updates","type":"text"}]},{"type":"paragraph","content":[{"text":"UpdateClusterSoftware","type":"text","marks":[{"type":"code_inline"}]},{"text":" fails and rolls back, or the cluster stays in a post-maintenance rollback state. Common causes: lifecycle script incompatible with new AMI, HMA version too old, insufficient rolling-update capacity. If the cluster has active nodes, collect diagnostics and escalate rather than delete-and-recreate. Full: ","type":"text"},{"text":"§ J","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#j-ami--cluster-updates","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"K: Dangling Nodes & Cleanup","type":"text"}]},{"type":"paragraph","content":[{"text":"Nodes in ","type":"text"},{"text":"kubectl get nodes","type":"text","marks":[{"type":"code_inline"}]},{"text":" but not in ","type":"text"},{"text":"list-cluster-nodes","type":"text","marks":[{"type":"code_inline"}]},{"text":" (ghost EKS nodes), or the inverse (HyperPod nodes that never registered kubelet). Script flags both. Full: ","type":"text"},{"text":"§ K","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#k-dangling-nodes--cleanup","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"L: Autoscaler Compatibility","type":"text"}]},{"type":"paragraph","content":[{"text":"Cluster Autoscaler errors on HyperPod provider IDs and breaks autoscaling for all node groups. No officially endorsed workaround — escalate to AWS Support. Karpenter does not conflict with HyperPod nodes by default. Full: ","type":"text"},{"text":"§ L","type":"text","marks":[{"type":"link","attrs":{"href":"references/cluster-diagnostics-detail.md#l-autoscaler-compatibility","title":null}}]},{"text":".","type":"text"}]},{"type":"hr","attrs":{"markup":"---"}},{"type":"heading","attrs":{"level":2},"content":[{"text":"Prerequisites","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"aws","type":"text","marks":[{"type":"code_inline"}]},{"text":" CLI v2.13+ authenticated to the cluster's account","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"jq","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"python3","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"bash","type":"text","marks":[{"type":"code_inline"}]},{"text":" 4.2+","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"kubectl","type":"text","marks":[{"type":"code_inline"}]},{"text":" authenticated to the EKS cluster (EKS checks skipped if absent)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"session-manager-plugin","type":"text","marks":[{"type":"code_inline"}]},{"text":" (Slurm controller health checks only)","type":"text"}]}]}]},{"type":"paragraph","content":[{"text":"IAM policy: ","type":"text"},{"text":"references/iam-permissions.md","type":"text","marks":[{"type":"link","attrs":{"href":"references/iam-permissions.md","title":null}}]},{"text":".","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Defaults","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Region","type":"text","marks":[{"type":"strong"}]},{"text":" — required: pass ","type":"text"},{"text":"--region","type":"text","marks":[{"type":"code_inline"}]},{"text":" or set ","type":"text"},{"text":"$AWS_DEFAULT_REGION","type":"text","marks":[{"type":"code_inline"}]},{"text":".","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Mode","type":"text","marks":[{"type":"strong"}]},{"text":" — ","type":"text"},{"text":"--cluster \u003cNAME>","type":"text","marks":[{"type":"code_inline"}]},{"text":" (diagnose) or ","type":"text"},{"text":"--validate","type":"text","marks":[{"type":"code_inline"}]},{"text":" (pre-create).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Event window","type":"text","marks":[{"type":"strong"}]},{"text":" — up to 500 most recent events (5 × 100, paginated).","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Colors","type":"text","marks":[{"type":"strong"}]},{"text":" — auto-disabled on non-TTY; ","type":"text"},{"text":"--no-color","type":"text","marks":[{"type":"code_inline"}]},{"text":" to force off.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Error handling","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Failure","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Script","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Tell the customer","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"aws sts get-caller-identity","type":"text","marks":[{"type":"code_inline"}]},{"text":" fails","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Exit 1","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"Fix AWS credentials and rerun.\"","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Cluster not found","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Exit 1 after listing region's clusters","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"Confirm HyperPod cluster name (not EKS) and region.\"","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"sagemaker:*","type":"text","marks":[{"type":"code_inline"}]},{"text":" / ","type":"text"},{"text":"ec2:*","type":"text","marks":[{"type":"code_inline"}]},{"text":" / ","type":"text"},{"text":"eks:*","type":"text","marks":[{"type":"code_inline"}]},{"text":" / ","type":"text"},{"text":"logs:*","type":"text","marks":[{"type":"code_inline"}]},{"text":" denied","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Warn, add ","type":"text"},{"text":"Missing IAM permission for \u003cAPI>","type":"text","marks":[{"type":"code_inline"}]},{"text":", continue","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"Grant the listed IAM action and rerun.\"","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"kubectl","type":"text","marks":[{"type":"code_inline"}]},{"text":" absent or unauthenticated","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Skip EKS checks (access entries, add-ons, aws-auth, nodes)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"Install/authenticate kubectl.\"","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"session-manager-plugin","type":"text","marks":[{"type":"code_inline"}]},{"text":" absent (Slurm)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Skip Slurm controller probe","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"Install session-manager-plugin.\"","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"SSM throttled / times out (180s)","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Retry with backoff; warn and continue if still failing","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"Rerun later — script is idempotent.\"","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"CloudWatch log group not found","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Skip CloudWatch check","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"\"CloudWatch not configured on this cluster.\"","type":"text"}]}]}]}]},{"type":"paragraph","content":[{"text":"Exit codes: ","type":"text"},{"text":"0","type":"text","marks":[{"type":"code_inline"}]},{"text":" no critical failures · ","type":"text"},{"text":"1","type":"text","marks":[{"type":"code_inline"}]},{"text":" one or more critical failures (cluster not found, fatal prerequisite missing, or any ","type":"text"},{"text":"[FAIL]","type":"text","marks":[{"type":"code_inline"}]},{"text":" in diagnose or ","type":"text"},{"text":"--validate","type":"text","marks":[{"type":"code_inline"}]},{"text":" mode). ","type":"text"},{"text":"[WARN]","type":"text","marks":[{"type":"code_inline"}]},{"text":" lines do not affect the exit code.","type":"text"}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Skill delegation","type":"text"}]},{"type":"table","attrs":{"layout":null},"content":[{"type":"tr","content":[{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Need","type":"text"}]}]},{"type":"th","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Use","type":"text"}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Shell on nodes","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"hyperpod-ssm","type":"text","marks":[{"type":"code_inline"}]}]}]}]},{"type":"tr","content":[{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"Version comparison across nodes","type":"text"}]}]},{"type":"td","attrs":{"colspan":1,"rowspan":1,"colwidth":null,"alignment":""},"content":[{"type":"paragraph","content":[{"text":"hyperpod-version-checker","type":"text","marks":[{"type":"code_inline"}]}]}]}]}]},{"type":"heading","attrs":{"level":2},"content":[{"text":"Escalate to AWS Support","type":"text"}]},{"type":"paragraph","content":[{"text":"Escalate when:","type":"text"}]},{"type":"ordered_list","attrs":{"order":1,"listStyle":"number"},"content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"EFA health checks fail despite correct SG rules.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Capacity errors persist despite a valid Flexible Training Plan / ODCR.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Node replacement fails repeatedly without clear events / log signal.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Cluster stuck in a non-terminal state (","type":"text"},{"text":"Creating","type":"text","marks":[{"type":"code_inline"}]},{"text":", ","type":"text"},{"text":"Updating","type":"text","marks":[{"type":"code_inline"}]},{"text":", or a post-maintenance rollback state) for an extended period.","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"CloudFormation root-cause is an internal service error.","type":"text"}]}]}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Before opening the case","type":"text"}]},{"type":"paragraph","content":[{"text":"Run these commands and attach the output. Goal: AWS Support has everything at case open.","type":"text"}]},{"type":"code_block","attrs":{"wrap":false,"language":"bash"},"content":[{"text":"# 1. Cluster identity + status (confirms region, ARN, orchestrator, instance groups)\naws sagemaker describe-cluster --cluster-name \u003cCLUSTER> --region \u003cREGION>\n\n# 2. Full cluster-level diagnostic bundle\nbash scripts/diagnose-cluster.sh --cluster \u003cCLUSTER> --region \u003cREGION> > diag.txt\n\n# 3. Per-node log/config bundle to S3 (delegates to hyperpod-issue-report skill)\n# See skills/hyperpod-issue-report/SKILL.md for the exact invocation.","type":"text"}]},{"type":"heading","attrs":{"level":3},"content":[{"text":"Include in the case","type":"text"}]},{"type":"bullet_list","content":[{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Cluster name + ARN (or ","type":"text"},{"text":"ClusterId","type":"text","marks":[{"type":"code_inline"}]},{"text":" suffix) and AWS region","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"ClusterStatus","type":"text","marks":[{"type":"code_inline"}]},{"text":" + ","type":"text"},{"text":"FailureMessage","type":"text","marks":[{"type":"code_inline"}]},{"text":" from ","type":"text"},{"text":"describe-cluster","type":"text","marks":[{"type":"code_inline"}]}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Timestamp window (UTC start / end) of the failure","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Exact error strings observed (copy verbatim from events / logs / console)","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"Affected instance IDs / ","type":"text"},{"text":"NodeLogicalId","type":"text","marks":[{"type":"code_inline"}]},{"text":"s / instance group names","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"diag.txt","type":"text","marks":[{"type":"code_inline"}]},{"text":" from step 2 above","type":"text"}]}]},{"type":"list_item","content":[{"type":"paragraph","content":[{"text":"S3 URI of the ","type":"text"},{"text":"hyperpod-issue-report","type":"text","marks":[{"type":"code_inline"}]},{"text":" bundle from step 3","type":"text"}]}]}]},{"type":"hr","attrs":{"markup":"---"}}]},"metadata":{"date":"2026-06-05","name":"hyperpod-cluster-debugger","author":"@skillopedia","source":{"stars":765,"repo_name":"agent-plugins","origin_url":"https://github.com/awslabs/agent-plugins/blob/HEAD/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md","repo_owner":"awslabs","body_sha256":"598f43f87c9e432ef2055bc5e6b627f0a003904acc18ddc490c67f307b35df65","cluster_key":"9067a983a3f5389093860c3e346a51de58e9581b5a42e9f60a8debe2b956454c","clean_bundle":{"format":"clean-skill-bundle-v1","source":"awslabs/agent-plugins/plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md","attachments":[{"id":"c54c5766-9a0a-510c-a574-676f00b809a7","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/c54c5766-9a0a-510c-a574-676f00b809a7/attachment.md","path":"references/capacity-planning.md","size":4653,"sha256":"ee2de0ba7b3a89992ab4af5d91944adf3bf8e7554186abdc89f96f4ee9fe54b3","contentType":"text/markdown; charset=utf-8"},{"id":"9339ff5e-eabe-52e4-b39b-f83d529a72f7","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9339ff5e-eabe-52e4-b39b-f83d529a72f7/attachment.md","path":"references/cloudformation-errors.md","size":5896,"sha256":"9bb8c60fbe5b11e73774c63d5ffe4199993082818b29c1606226bbbf3ae2aebc","contentType":"text/markdown; charset=utf-8"},{"id":"fd8d1938-c8be-5f7c-8cbf-6722cbbd6477","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/fd8d1938-c8be-5f7c-8cbf-6722cbbd6477/attachment.md","path":"references/cluster-diagnostics-detail.md","size":23900,"sha256":"f6cf04051c3f4fac642d9b1405ecb77fd97c7ac5378263d18b5a1dfc366ed336","contentType":"text/markdown; charset=utf-8"},{"id":"fcaa6db5-e47a-5303-9fa1-c8a969b8f0c9","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/fcaa6db5-e47a-5303-9fa1-c8a969b8f0c9/attachment.md","path":"references/cluster-operations.md","size":11785,"sha256":"c0a635a4256e230dcc7281e53a17fb236ddc33b93a315c4d85bdb94af0ee9a77","contentType":"text/markdown; charset=utf-8"},{"id":"33560c1f-375f-5009-8881-0fbd87eb895c","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/33560c1f-375f-5009-8881-0fbd87eb895c/attachment.md","path":"references/iam-permissions.md","size":1200,"sha256":"cbe646b61f2189d7f0c751d4210126045f183ee33415f71f62413c0d73ba65de","contentType":"text/markdown; charset=utf-8"},{"id":"2754f26a-671d-5ff1-8b18-6635a45fe211","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/2754f26a-671d-5ff1-8b18-6635a45fe211/attachment.md","path":"references/lifecycle-scripts.md","size":5225,"sha256":"4c50f65d0f869e32684238bd05e3bdc9f236c9c5afb8b1294fed0a101793ea39","contentType":"text/markdown; charset=utf-8"},{"id":"9e414e7e-bd98-5f72-bac9-d2f16756600d","key":"uploads/10433ee7-ad12-4ae0-b34e-97553e46c6c8/9e414e7e-bd98-5f72-bac9-d2f16756600d/attachment.sh","path":"scripts/diagnose-cluster.sh","size":70693,"sha256":"3a1cdd824abd39aa621f309b49afa5dc197fa906cc6e7948832c785f85b3ea5a","contentType":"application/x-sh; charset=utf-8"}],"bundle_sha256":"1a3eaceba52dcb97ddb2e2e57a179fcaa6634896b3ce4436fdf71da7b9259a5d","attachment_count":7,"text_attachments":7,"attachment_storage":"skillopedia-attachments-v1","binary_attachments":0,"excluded_attachments":[]},"cluster_size":1,"skill_md_path":"plugins/sagemaker-ai/skills/hyperpod-cluster-debugger/SKILL.md","import_metadata":{"date":"2026-06-05","author":"@skillopedia","version":"v1","category":"devops-infrastructure","category_label":"DevOps"},"exact_dupes_collapsed_into_this":0},"version":"v1","category":"devops-infrastructure","metadata":{"version":"0.0.1"},"import_tag":"clean-skills-v1","description":"Diagnose and remediate cluster-wide HyperPod (EKS or Slurm) problems — creation / deployment failures (CloudFormation, EFA health check, lifecycle scripts, capacity), EKS access, node replacement, CloudFormation nested-stack errors, post-maintenance rollback state, dangling nodes, autoscaler conflicts. Includes `--validate` pre-flight. Read-only."}},"renderedAt":1782979469241}

Important: agents should read /llm.txt, /llms.txt, or /.well-known/skills.json to discover the public Skillopedia API.