Re: Kernel oops+crash on repeated auditd restarts

Thursday, 5 April 2012

please please please keep on list.  Everything you say might help track
it down!

On Thu, 2012-04-05 at 14:03 -0700, Peter Moody wrote:
...
 (please let me know if I should take this off-list)

 One other thing (again, maybe already known), but this seems to be
 exacerbated by SMP. On my machine, I can't reproduce the crash if I
 booth with maxcpus=1.

 Still hunting.

 Cheers,
 peter

 On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <pmoody(a)google.com&gt; wrote:
 > This may already be known, but the issue seems to be limited to watch
 > rules. With any watch rules, I can reliably crash my machine while
 > freeing a watch rule after only starting/stopping auditd a few times.
 > With no watch rules, I have no issues.
 >
 > Cheers,
 > peter
 >
 > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram <aval13(a)gmail.com&gt; wrote:
 >> Yes, i know that patch. It made it into kernel 3.2.2. I tested it
 >> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops i'm seeing is
 >> also in 3.2.9.
 >>
 >> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes either
 >> in audit subsystem or in fsnotify. I'll try to reproduce in latest 3.2.13
 >> and repost the oops, but i'm 99% confident it will be the same.
 >>
 >> Sadly nobody except you seems to pay attention to this problem, probably
 >> because it requires special conditions to reproduce (really, who starts and
 >> stops auditd every 5 seconds on a production server?). We only ran into it
 >> because one of our servers would randomly oops and then freeze about each
 >> month after stopping and then starting
 >>
 >> auditd
 >>
 >> every morning (and the stop-start sequence was needed to workaround a bug
 >> somewhere that would hang a
 >>
 >> gzip
 >>
 >> running on a file outside a watched folder).
 >>
 >> Anyway, as a last note, i have a feeling that the oops is not exactly
 >> random, there is a pattern, just that i haven't figured it out completely
 >> yet.
 >>
 >> Will keep you
 >>
 >> uptodate
 >>
 >> with the things i find out.
 >>
 >> V.
 >>
 >> On Mar 29, 2012 4:14 AM, "Eric Paris" <eparis(a)redhat.com&gt;
wrote:
 >>>
 >>> That patch fixes a BUG() .  The report has a NULL ptr deref and some
 >>> apparent list correuption....  Sadly they aren't the same....
 >>>
 >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
 >>> > fyi: this patch [1] seems to fix the issue for me. The explanation in
 >>> > the subject would reliably oops my machine.
 >>> >
 >>> > [1]
 >>> >
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit...
 >>> >
 >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody <pmoody(a)google.com&gt;
wrote:
 >>> > > Are you still able to reliably reproduce this oops? I'm trying
to
 >>> > > track this down because this bug (or a very similar bug) is
causing
 >>> > > some significant headaches here at work, but I haven't had a
lot of
 >>> > > luck. I'm using usermode linux, though, so that might be
interfering
 >>> > > with things.
 >>> > >
 >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram
<aval13(a)gmail.com&gt;
 >>> > > wrote:
 >>> > >> Finally i found some time and spare server to retest the oops
and
 >>> > >> list_add
 >>> > >> corruptions i was getting with the 3.x kernels and auditd
2.1.3.
 >>> > >>
 >>> > >> I tested now with gentoo's latest stable 3.2.1-gentoo-r2
and
 >>> > >> kernel.org's
 >>> > >> 3.2.9.
 >>> > >>
 >>> > >> Both get the oops/BUG in the same way and after that, they
keep
 >>> > >> pouring
 >>> > >> list_add corruptions with audit_prune_tre(truncated?) and
auditctl as
 >>> > >> comms.
 >>> > >>
 >>> > >> Since this is not about Gentoo's kernel only, i'll
post here the oops
 >>> > >> in
 >>> > >> 3.2.9 and also attach some list_add corruptions.
 >>> > >>
 >>> > >> 3.2.9 BUG:
 >>> > >>
 >>> > >> kernel: [  301.240011] BUG: unable to handle kernel NULL
pointer
 >>> > >> dereference
 >>> > >> at   (null)
 >>> > >> kernel: [  301.240305] IP: [<c1238dd0>]
__list_del_entry+0x20/0xe0
 >>> > >> kernel: [  301.240481] *pdpt = 0000000000000000 *pde =
 >>> > >> f000ddc8f000ddc8
 >>> > >> kernel: [  301.240698] Oops: 0000 [#1] SMP
 >>> > >> kernel: [  301.240910]
 >>> > >> kernel: [  301.241030] Pid: 642, comm: fsnotify_mark Not
tainted
 >>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge 2950/0CX396
 >>> > >> kernel: [  301.241370] EIP: 0060:[<c1238dd0>] EFLAGS:
00010287 CPU: 6
 >>> > >> kernel: [  301.241498] EIP is at __list_del_entry+0x20/0xe0
 >>> > >> kernel: [  301.241623] EAX: f4fae544 EBX: f47cffa4 ECX:
ffffffff EDX:
 >>> > >> 00000000
 >>> > >> kernel: [  301.241751] ESI: f4fae544 EDI: f4fae508 EBP:
f47cff7c ESP:
 >>> > >> f47cff64
 >>> > >> kernel: [  301.241879]  DS: 007b ES: 007b FS: 00d8 GS: 0000
SS: 0068
 >>> > >> kernel: [  301.242005] Process fsnotify_mark (pid: 642,
ti=f47ce000
 >>> > >> task=f4f47c00 task.ti=f47ce000)
 >>> > >> kernel: [  301.242207] Stack:
 >>> > >> kernel: [  301.242327]  c10813c0 f47cffa4 f4f47c00 f4e70888
f47cff7c
 >>> > >> f47cffa4 f47cffb8 c10f6976
 >>> > >> kernel: [  301.242882]  ffffffc3 f4f47c00 f4f47c00 00000000
f4f47c00
 >>> > >> c10530c0 f47cff9c f47cff9c
 >>> > >> kernel: [  301.243438]  f4fae544 f4fae544 f4c47f58 00000000
c10f68f0
 >>> > >> f47cffe4 c1052834 00000000
 >>> > >> kernel: [  301.243995] Call Trace:
 >>> > >> kernel: [  301.244119]  [<c10813c0>] ?
 >>> > >> rcu_check_callbacks+0x110/0x110
 >>> > >> kernel: [  301.244248]  [<c10f6976>]
fsnotify_mark_destroy+0x86/0x120
 >>> > >> kernel: [  301.244377]  [<c10530c0>] ?
abort_exclusive_wait+0x80/0x80
 >>> > >> kernel: [  301.244504]  [<c10f68f0>] ?
fsnotify_put_mark+0x30/0x30
 >>> > >> kernel: [  301.244631]  [<c1052834>] kthread+0x74/0x80
 >>> > >> kernel: [  301.244756]  [<c10527c0>] ?
 >>> > >> kthread_flush_work_fn+0x10/0x10
 >>> > >> kernel: [  301.244885]  [<c1582ab6>]
kernel_thread_helper+0x6/0xd
 >>> > >> kernel: [  301.245011] Code: 55 f4 8b 45 f8 e9 75 ff ff ff 90
55 89
 >>> > >> e5 53 83
 >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa 00 02 20 00
0f 84
 >>> > >> 8e 00
 >>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75 35 89 51 04
89 0a 83 c4
 >>> > >> 14
 >>> > >> kernel: [  301.248195] EIP: [<c1238dd0>]
__list_del_entry+0x20/0xe0
 >>> > >> SS:ESP
 >>> > >> 0068:f47cff64
 >>> > >> kernel: [  301.248414] CR2: 0000000000000000
 >>> > >> kernel: [  301.248538] ---[ end trace 15082dbfb353f84c ]---
 >>> > >>
 >>> > >> The kernel was compiled with the following DEBUG support (the
bolded
 >>> > >> one
 >>> > >> were requested by Gentoo's Dev:
 >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
 >>> > >> CONFIG_SLUB_DEBUG=y
 >>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
 >>> > >> CONFIG_X86_DEBUGCTLMSR=y
 >>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
 >>> > >> CONFIG_AIC94XX_DEBUG=y
 >>> > >> CONFIG_USB_DEBUG=y
 >>> > >> CONFIG_DEBUG_KERNEL=y
 >>> > >> CONFIG_SCHED_DEBUG=y
 >>> > >> CONFIG_DEBUG_RT_MUTEXES=y
 >>> > >> CONFIG_DEBUG_PI_LIST=y
 >>> > >> CONFIG_DEBUG_BUGVERBOSE=y
 >>> > >> CONFIG_DEBUG_INFO=y
 >>> > >> CONFIG_DEBUG_MEMORY_INIT=y
 >>> > >> CONFIG_DEBUG_LIST=y
 >>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
 >>> > >> CONFIG_DEBUG_RODATA=y
 >>> > >> CONFIG_DEBUG_RODATA_TEST=y
 >>> > >>
 >>> > >> I attached the kernel config i used for 3.2.9 to generate this
oops
 >>> > >> and
 >>> > >> warnings.
 >>> > >>
 >>> > >> From the list_add warnings that come after, out of 805
warnings i
 >>> > >> processed,
 >>> > >> after masking with XXXXX the PID and next= values that kept
changing
 >>> > >> in
 >>> > >> every one, i got 26 types of MD5. I also attached the files
relevant
 >>> > >> as an
 >>> > >> archive to this email.
 >>> > >>
 >>> > >> The Gentoo bug i opened is sleeping, it seems nobody has the
time to
 >>> > >> at
 >>> > >> least test to confirm or not the problems i'm seeing (or
everybody's
 >>> > >> thinking that nobody would restart auditd so often, so the bug
it's
 >>> > >> not that
 >>> > >> serious).
 >>> > >>
 >>> > >>
 >>> > >> Thank you for your time.
 >>> > >>
 >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram
<aval13(a)gmail.com&gt;
 >>> > >> wrote:
 >>> > >>
 >>> > >>
 >>> > >> --
 >>> > >> Linux-audit mailing list
 >>> > >> Linux-audit(a)redhat.com
 >>> > >> https://www.redhat.com/mailman/listinfo/linux-audit
 >>> > >
 >>> > >
 >>> > >
 >>> > > --
 >>> > > Peter Moody      Google    1.650.253.7306
 >>> > > Security Engineer  pgp:0xC3410038
 >>> >
 >>> >
 >>> >
 >>>
 >>>
 >>
 >
 >
 >
 > --
 > Peter Moody      Google    1.650.253.7306
 > Security Engineer  pgp:0xC3410038

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: Kernel oops+crash on repeated auditd restarts