How to troubleshoot AIX EEH temporary error for adapter
EEH stands for Enhanced Error Handling, which is an error-recovery mechanism for errors that occur during load and store operations on the PCI bus[^2^]. EEH is supported by newer POWER processor-based servers that have EADS chips in each PCI slot[^2^].
Sometimes, an EEH event may occur due to a temporary error on the PCI bus, such as a link recovery of a fiber adapter. In this case, AIX will try to recover the slot by resetting and reinitializing the adapter. However, this process may take some time and cause performance degradation or application failures[^1^].
To troubleshoot this issue, you can use the following steps:
Check the errpt command output for any EEH errors related to the adapter. The error class will be PERM or TEMP, and the error type will be INFO or UNKN. The error label will be PCI_ERR1 or PCI_ERR2.
Check the /var/log/eehlog file for more details about the EEH event, such as the slot number, device name, vendor ID, device ID, and firmware level.
Check the adapter status using the lsdev command with the -Cc adapter option. The status should be Available or Defined. If the status is Defined, it means that AIX failed to recover the slot and you need to reboot the system.
Check the adapter configuration using the lscfg command with the -vl option and the device name. The output should show the EEH capabilities and settings of the adapter.
If possible, update the adapter firmware and driver to the latest level. This may help to prevent or reduce EEH events in the future.
If the problem persists, contact IBM support for further assistance.
I hope this article helps you to understand and resolve AIX EEH temporary error for adapter. Thank you for reading.Here are some additional information that may be useful for you:
EEH events are logged in the system error log and the EEH log. You can use the errclear command to clear the error log entries, and the eehclear command to clear the EEH log entries.
EEH events can be classified into three types: recoverable, non-recoverable, and fatal. Recoverable events are those that can be fixed by resetting and reinitializing the slot. Non-recoverable events are those that require a reboot of the system to restore the slot. Fatal events are those that cause a system crash or hang.
EEH events can be caused by various factors, such as hardware defects, firmware bugs, driver issues, environmental conditions, or external interference. Some common causes of EEH events are: loose cables, faulty connectors, dust or dirt on the PCI bus, power fluctuations, electrostatic discharge, or radiation.
EEH events can be prevented or minimized by following some best practices, such as: keeping the system firmware and adapter firmware up to date, using certified adapters and cables, ensuring proper ventilation and cooling of the system, avoiding physical shocks or vibrations to the system, and using proper grounding and shielding techniques.
I hope this article helps you to understand and resolve AIX EEH temporary error for adapter. Thank you for reading. aa16f39245