diff options
Diffstat (limited to 'Documentation/drivers/edac/edac.txt')
-rw-r--r-- | Documentation/drivers/edac/edac.txt | 152 |
1 files changed, 27 insertions, 125 deletions
diff --git a/Documentation/drivers/edac/edac.txt b/Documentation/drivers/edac/edac.txt index 70d96a6..7b3d969 100644 --- a/Documentation/drivers/edac/edac.txt +++ b/Documentation/drivers/edac/edac.txt @@ -35,15 +35,14 @@ the vendor should tie the parity status bits to 0 if they do not intend to generate parity. Some vendors do not do this, and thus the parity bit can "float" giving false positives. -The PCI Parity EDAC device has the ability to "skip" known flaky -cards during the parity scan. These are set by the parity "blacklist" -interface in the sysfs for PCI Parity. (See the PCI section in the sysfs -section below.) There is also a parity "whitelist" which is used as -an explicit list of devices to scan, while the blacklist is a list -of devices to skip. +[There are patches in the kernel queue which will allow for storage of +quirks of PCI devices reporting false parity positives. The 2.6.18 +kernel should have those patches included. When that becomes available, +then EDAC will be patched to utilize that information to "skip" such +devices.] -EDAC will have future error detectors that will be added or integrated -into EDAC in the following list: +EDAC will have future error detectors that will be integrated with +EDAC or added to it, in the following list: MCE Machine Check Exception MCA Machine Check Architecture @@ -93,22 +92,24 @@ EDAC lives in the /sys/devices/system/edac directory. Within this directory there currently reside 2 'edac' components: mc memory controller(s) system - pci PCI status system + pci PCI control and status system ============================================================================ Memory Controller (mc) Model First a background on the memory controller's model abstracted in EDAC. -Each mc device controls a set of DIMM memory modules. These modules are +Each 'mc' device controls a set of DIMM memory modules. These modules are laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can -be multiple csrows and two channels. +be multiple csrows and multiple channels. Memory controllers allow for several csrows, with 8 csrows being a typical value. Yet, the actual number of csrows depends on the electrical "loading" of a given motherboard, memory controller and DIMM characteristics. Dual channels allows for 128 bit data transfers to the CPU from memory. +Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs +(FB-DIMMs). The following example will assume 2 channels: Channel 0 Channel 1 @@ -234,23 +235,15 @@ Polling period control file: The time period, in milliseconds, for polling for error information. Too small a value wastes resources. Too large a value might delay necessary handling of errors and might loose valuable information for - locating the error. 1000 milliseconds (once each second) is about - right for most uses. + locating the error. 1000 milliseconds (once each second) is the current + default. Systems which require all the bandwidth they can get, may + increase this. LOAD TIME: module/kernel parameter: poll_msec=[0|1] RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec -Module Version read-only attribute file: - - 'mc_version' - - The EDAC CORE module's version and compile date are shown here to - indicate what EDAC is running. - - - ============================================================================ 'mcX' DIRECTORIES @@ -284,35 +277,6 @@ Seconds since last counter reset control file: -DIMM capability attribute file: - - 'edac_capability' - - The EDAC (Error Detection and Correction) capabilities/modes of - the memory controller hardware. - - -DIMM Current Capability attribute file: - - 'edac_current_capability' - - The EDAC capabilities available with the hardware - configuration. This may not be the same as "EDAC capability" - if the correct memory is not used. If a memory controller is - capable of EDAC, but DIMMs without check bits are in use, then - Parity, SECDED, S4ECD4ED capabilities will not be available - even though the memory controller might be capable of those - modes with the proper memory loaded. - - -Memory Type supported on this controller attribute file: - - 'supported_mem_type' - - This attribute file displays the memory type, usually - buffered and unbuffered DIMMs. - - Memory Controller name attribute file: 'mc_name' @@ -321,16 +285,6 @@ Memory Controller name attribute file: that is being utilized. -Memory Controller Module name attribute file: - - 'module_name' - - This attribute file displays the memory controller module name, - version and date built. The name of the memory controller - hardware - some drivers work with multiple controllers and - this field shows which hardware is present. - - Total memory managed by this memory controller attribute file: 'size_mb' @@ -432,6 +386,9 @@ Memory Type attribute file: This attribute file will display what type of memory is currently on this csrow. Normally, either buffered or unbuffered memory. + Examples: + Registered-DDR + Unbuffered-DDR EDAC Mode of operation attribute file: @@ -446,8 +403,13 @@ Device type attribute file: 'dev_type' - This attribute file will display what type of DIMM device is - being utilized. Example: x4 + This attribute file will display what type of DRAM device is + being utilized on this DIMM. + Examples: + x1 + x2 + x4 + x8 Channel 0 CE Count attribute file: @@ -522,10 +484,10 @@ SYSTEM LOGGING If logging for UEs and CEs are enabled then system logs will have error notices indicating errors that have been detected: -MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, +EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac -MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, +EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac @@ -610,64 +572,4 @@ Parity Count: -PCI Device Whitelist: - - 'pci_parity_whitelist' - - This control file allows for an explicit list of PCI devices to be - scanned for parity errors. Only devices found on this list will - be examined. The list is a line of hexadecimal VENDOR and DEVICE - ID tuples: - - 1022:7450,1434:16a6 - - One or more can be inserted, separated by a comma. - - To write the above list doing the following as one command line: - - echo "1022:7450,1434:16a6" - > /sys/devices/system/edac/pci/pci_parity_whitelist - - - - To display what the whitelist is, simply 'cat' the same file. - - -PCI Device Blacklist: - - 'pci_parity_blacklist' - - This control file allows for a list of PCI devices to be - skipped for scanning. - The list is a line of hexadecimal VENDOR and DEVICE ID tuples: - - 1022:7450,1434:16a6 - - One or more can be inserted, separated by a comma. - - To write the above list doing the following as one command line: - - echo "1022:7450,1434:16a6" - > /sys/devices/system/edac/pci/pci_parity_blacklist - - - To display what the whitelist currently contains, - simply 'cat' the same file. - ======================================================================= - -PCI Vendor and Devices IDs can be obtained with the lspci command. Using -the -n option lspci will display the vendor and device IDs. The system -administrator will have to determine which devices should be scanned or -skipped. - - - -The two lists (white and black) are prioritized. blacklist is the lower -priority and will NOT be utilized when a whitelist has been set. -Turn OFF a whitelist by an empty echo command: - - echo > /sys/devices/system/edac/pci/pci_parity_whitelist - -and any previous blacklist will be utilized. - |