Configuring Kernel dump in Suse Linux (SLES)


Prerequisites

Kdump stores kernel core dumps under /var. The partition that /var is on must have enough available disk space for the vmcore file, which will be approximately the size of the system's physical memory. By default, the system will attempt to keep 5 vmcore files.

Check the taint status of the kernel (recommended)

Whenever possible, kernel crashes should be reproduced using untainted kernels.

Set up magic SysRq (recommended)

For kernel problems other than a kernel oops or panic, a kernel core dump is not triggered automatically. If the system still responds to keyboard input to some degree, a kernel core dump can be triggered manually through a "magic SysRq" keyboard combination (typically: hold down three keys simultaneously: the left Alt key, the Print Screen / SysRq key and a letter key indicating the command - ' s' for sync, ' c' for core dump), if this feature has been enabled.

For general documentation of the "magic SysRq" feature, please refer to the Documentation/sysrq.txt file in the Linux kernel source.

To enable the magic SysRq feature permanently, edit /etc/sysconfig/sysctl, change the ENABLE_SYSRQ line to ENABLE_SYSRQ="yes". This change becomes active after a reboot. To enable the feature for the running kernel, run
echo 1>/proc/sys/kernel/sysrq


Configure the system for capturing kernel core dumps (SLES 10)
  1. Install the packages kernel-kdump, kdump, and kexec-tools.

    The kernel-kdump package contains a "crash" or "capture" kernel that is started when the primary kernel has crashed and which provides an environment in which the primary kernel's state can be captured. The version of the kernel-dump package needs to be identical to that of the kernel whose state needs to be captured.

    The kexec-tools package contains the tools that make it possible to start the capture kernel from the primary kernel. 
  2. Reserve memory for the capture kernel by passing appropriate parameters to the primary kernel.
    For the x86 and x86_64 architecture use the table below based upon how much memory you have.  
Memory
crashkernel=
0 - 12 GB
64M@16M
13 - 48 GB
128M@16M
49 - 128 GB
256M@16M
129 - 256 GB
512M@16M


For the PPC64 architecture: crashkernel=128M@32M
Note: for Xen installations, this parameter needs to be passed to the GRUB line for the Xen hypervisor, not the module line for the Dom0 kernel.

This can be done as follows: Start YaST, under System, select Boot Loader. On the tab Section Management, select the default section and select Edit. Add the settings to the field labeled Optional Kernel Command Line Parameter , then select Ok and Finish to save the settings.
  1. Activate the kdump system service.

    Run
    chkconfig kdump on

    or in YaST: under System, select System Services (Runlevel), select kdump , then select Enable and Finish.

  1. Reboot the system for the settings to take effect

Test local kernel core dump capture

To test the local kernel core dump capture, follow these steps.
If magic SysRq has been configured:
  1. Magic-SysRq-S to sync (flush out pending writes)
  1. Magic-SysRq-C to trigger the kernel core dump
Alternatively, without magic SysRq:
  1. Open a shell or terminal
  2. Run sync
  1. Run echo c >/proc/sysrq-trigger
Please note that the 'c' must be lower case! Also, the system will not be responsive while the capture is being prepared and made as the capture kernel environment is a limited, non-interactive environment.

Once the system becomes responsive again, verify that a capture file was created as /var/log/dump/ date-time /vmcore.

Linux kernel taint mechanism

What is kernel taint mechanism?
  • The Linux kernel maintains a "taint state" which is included in kernel error messages. 
  • The taint state provides an indication whether something has happened to the running kernel that affects whether a kernel error or hang.
  • It can be used to troubleshot effectively by analyzing the kernel source code. 
As an example, the taint state is set when a machine check exception (MCE) has been raised, indicating a hardware related problem has occurred. Once the taint state of a running kernel has been set, it cannot be unset other than by reloading the kernel, that is by shutting down and then restarting the system.


Taint flags

The taint status of the kernel not only indicates whether or not the kernel has been tainted but also indicates what type of event caused the kernel to be marked as tainted. This information is encoded through single-character flags in the string following "Tainted:" in a kernel error message.
  • P: A module with a Proprietary license has been loaded, i.e. a module that is not licensed under the GNU General Public License (GPL) or a compatible license. This may indicate that source code for this module is not available to the Linux kernel developers or to Novell's developers.
  • G: The opposite of 'P': the kernel has been tainted (for a reason indicated by a different flag), but all modules loaded into it were licensed under the GPL or a license compatible with the GPL.
  • F: A module was loaded using the Force option "-f" of insmod or modprobe, which caused a sanity check of the versioning information from the module (if present) to be skipped.
  • R: A module which was in use or was not designed to be removed has been forcefully Removed from the running kernelusing the force option "-f" of rmmod.
  • S: The Linux kernel is running with Symmetric MultiProcessor support (SMP), but the CPUs in the system are not designed or certified for SMP use.
  • M: A Machine Check Exception (MCE) has been raised while the kernel was running. MCEs are triggered by the hardware to indicate a hardware related problem, for example the CPU's temperature exceeding a treshold or a memory bank signaling an uncorrectable error.
  • B: A process has been found in a Bad page state, indicating a corruption of the virtual memory subsystem, possibly caused by malfunctioning RAM or cache memory.
The taint flags above are implemented in the standard Linux kernel and indicate the information provided in kernel error messages is not necessarily to be trusted.

In SUSE kernels, additional taint flags are implemented.
  • U: An Unsupported module has been loaded, i.e. a module which is not supported by Novell and which is not known to be supported by a third party. For example, the module is a driver that is not yet mature enough to be supportable or is a driver for an obsolete type of hardware which can no longer be tested adequately.
  • X: A module that is not supported by Novell but that is supported eXternally by a third party has been loaded into the kernel.

Determining the taint status of a running kernel

The taint status of a running kernel can be determined by running

#cat /proc/sys/kernel/tainted

When the output is 0, the kernel is not tainted. When the output is non-zero, the kernel is tainted. The value will be a combined number of all applying kernel taint flags added (ORed) together. You can find a list of currently used kernel flags under:
#cat /usr/src/linux/Documentation/sysctl/kernel.txt

When the kernel produces an error, a string detailing the taint status will be included.

Source: Novell