An System Event Log (SEL) Viewer is a critical diagnostic tool used to read the non-volatile memory on a server’s motherboard, which records hardware-level events and failures managed by the Baseboard Management Controller (BMC). Deciphering these logs is the fastest way to pinpoint why a server crashed due to critical hardware faults. 🛠️ Common Hardware Faults in SEL
Memory Errors (ECC): Shows uncorrectable ECC errors. Indicates a failing RAM stick. Causes instant Blue Screens (BSOD) or Purple Screens (PSOD).
Processor Faults (IERR): Registers an Internal Error (IERR) or Machine Check Exception (MCE). Indicates CPU instability, voltage drops, or overheating.
Power Supply Failures (PSU): Logs loss of redundancy or voltage out of range. Results in sudden power-offs without OS warning.
Thermal Events: Reports ambient or component temperatures exceeding critical thresholds. Triggers automatic protective shutdowns.
PCIe/Bus Errors: Logs fatal bus errors on slots. Points to failing RAID controllers, NICs, or GPUs. 🔍 How to Access the SEL Viewer
Out-of-Band (Recommended): Log into the BMC web interface (iDRAC, iLO, IPMI). Navigate to Maintenance or Logs. This works even if the OS is completely dead.
In-Band (Live OS): Use command-line tools like ipmitool in Linux or ipmiutil in Windows. Run ipmitool sel elist to dump the event list.
Pre-Boot: Press the vendor-specific hotkey (e.g., F2 or F11) during POST to enter System Setup and view the hardware logs. 🧩 Deciphering the Log Format
Every SEL entry contains key metadata fields that you must decode:
Timestamp: Exact second the failure occurred. Match this with OS syslog or event viewer logs to find the trigger.
Sensor Name: Identifies the exact component (e.g., CPU 1 Status, DIMM A2, PSU 2 Status).
Event Description: State change detail. Look for keywords like Critical assertion, Uncorrectable error, or Lower Critical Non-recoverable. 🚀 Step-by-Step Triage Workflow
Clear the Noise: Export the entire log to CSV or TXT. Filter strictly for “Critical” or “Fatal” severities.
Find the First Domino: Look at the timestamps. Identify the very first critical error before the “Power Unit / Soft Power Control Failure” log.
Isolate the Component: Use the slot label (like DIMM B1) to locate the physical part on the motherboard diagram.
Cross-Reference: Check the OS system logs at that exact timestamp to see if a software driver triggered the hardware failure.
If you are dealing with an active crash, I can help you decode it if you share:
The server model (e.g., Dell PowerEdge R740, HP ProLiant DL360) The exact text or error code from your SEL Viewer The operating system running on the machine
Leave a Reply