Last thing. moving synchronize_srcu(&fsnotify_mark_srcu) out of the
for(;;) loop in fs/notify/mark.c appears to solve the stability issues
for me. I don't know enough about kernel internals to determine if
this is doing lots of other bad things to my system or not.
Cheers,
peter
On Tue, Apr 17, 2012 at 11:24 AM, Peter Moody <pmoody(a)google.com> wrote:
and my config.gz
On Tue, Apr 17, 2012 at 10:56 AM, Peter Moody <pmoody(a)google.com> wrote:
> Here's a trace with debugging turned way up plus a few extra printk's
> added to fs/notify/mark.c. I'm looping through private_destroy_list
> before and after the call to synchronize_srcu.
>
> I can reproduce this reliably with kvm with 2 virtual processors:
> Linux desktop 3.4.0-rc3-oops1+ #1 SMP Tue Apr 17 09:59:44 PDT 2012
> x86_64 GNU/Linux
>
> Cheers,
> peter
>
> On Thu, Apr 5, 2012 at 2:07 PM, Eric Paris <eparis(a)redhat.com> wrote:
>> please please please keep on list. Everything you say might help track
>> it down!
>>
>> On Thu, 2012-04-05 at 14:03 -0700, Peter Moody wrote:
>>> (please let me know if I should take this off-list)
>>>
>>> One other thing (again, maybe already known), but this seems to be
>>> exacerbated by SMP. On my machine, I can't reproduce the crash if I
>>> booth with maxcpus=1.
>>>
>>> Still hunting.
>>>
>>> Cheers,
>>> peter
>>>
>>> On Tue, Apr 3, 2012 at 9:15 AM, Peter Moody <pmoody(a)google.com> wrote:
>>> > This may already be known, but the issue seems to be limited to watch
>>> > rules. With any watch rules, I can reliably crash my machine while
>>> > freeing a watch rule after only starting/stopping auditd a few times.
>>> > With no watch rules, I have no issues.
>>> >
>>> > Cheers,
>>> > peter
>>> >
>>> > On Wed, Mar 28, 2012 at 11:44 PM, Valentin Avram
<aval13(a)gmail.com> wrote:
>>> >> Yes, i know that patch. It made it into kernel 3.2.2. I tested it
>>> >> successfully (oops in 3.2.1, no oops in 3.2.9), but this oops
i'm seeing is
>>> >> also in 3.2.9.
>>> >>
>>> >> I monitored changelogs since 3.2.1 to 3.2.12 but there were no fixes
either
>>> >> in audit subsystem or in fsnotify. I'll try to reproduce in
latest 3.2.13
>>> >> and repost the oops, but i'm 99% confident it will be the same.
>>> >>
>>> >> Sadly nobody except you seems to pay attention to this problem,
probably
>>> >> because it requires special conditions to reproduce (really, who
starts and
>>> >> stops auditd every 5 seconds on a production server?). We only ran
into it
>>> >> because one of our servers would randomly oops and then freeze about
each
>>> >> month after stopping and then starting
>>> >>
>>> >> auditd
>>> >>
>>> >> every morning (and the stop-start sequence was needed to workaround
a bug
>>> >> somewhere that would hang a
>>> >>
>>> >> gzip
>>> >>
>>> >> running on a file outside a watched folder).
>>> >>
>>> >> Anyway, as a last note, i have a feeling that the oops is not
exactly
>>> >> random, there is a pattern, just that i haven't figured it out
completely
>>> >> yet.
>>> >>
>>> >> Will keep you
>>> >>
>>> >> uptodate
>>> >>
>>> >> with the things i find out.
>>> >>
>>> >> V.
>>> >>
>>> >> On Mar 29, 2012 4:14 AM, "Eric Paris"
<eparis(a)redhat.com> wrote:
>>> >>>
>>> >>> That patch fixes a BUG() . The report has a NULL ptr deref and
some
>>> >>> apparent list correuption.... Sadly they aren't the
same....
>>> >>>
>>> >>> On Wed, 2012-03-28 at 15:42 -0700, Peter Moody wrote:
>>> >>> > fyi: this patch [1] seems to fix the issue for me. The
explanation in
>>> >>> > the subject would reliably oops my machine.
>>> >>> >
>>> >>> > [1]
>>> >>> >
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit...
>>> >>> >
>>> >>> > On Wed, Mar 28, 2012 at 1:51 PM, Peter Moody
<pmoody(a)google.com> wrote:
>>> >>> > > Are you still able to reliably reproduce this oops?
I'm trying to
>>> >>> > > track this down because this bug (or a very similar
bug) is causing
>>> >>> > > some significant headaches here at work, but I
haven't had a lot of
>>> >>> > > luck. I'm using usermode linux, though, so that
might be interfering
>>> >>> > > with things.
>>> >>> > >
>>> >>> > > On Mon, Mar 5, 2012 at 12:35 AM, Valentin Avram
<aval13(a)gmail.com>
>>> >>> > > wrote:
>>> >>> > >> Finally i found some time and spare server to
retest the oops and
>>> >>> > >> list_add
>>> >>> > >> corruptions i was getting with the 3.x kernels and
auditd 2.1.3.
>>> >>> > >>
>>> >>> > >> I tested now with gentoo's latest stable
3.2.1-gentoo-r2 and
>>> >>> > >> kernel.org's
>>> >>> > >> 3.2.9.
>>> >>> > >>
>>> >>> > >> Both get the oops/BUG in the same way and after
that, they keep
>>> >>> > >> pouring
>>> >>> > >> list_add corruptions with
audit_prune_tre(truncated?) and auditctl as
>>> >>> > >> comms.
>>> >>> > >>
>>> >>> > >> Since this is not about Gentoo's kernel only,
i'll post here the oops
>>> >>> > >> in
>>> >>> > >> 3.2.9 and also attach some list_add corruptions.
>>> >>> > >>
>>> >>> > >> 3.2.9 BUG:
>>> >>> > >>
>>> >>> > >> kernel: [ 301.240011] BUG: unable to handle
kernel NULL pointer
>>> >>> > >> dereference
>>> >>> > >> at (null)
>>> >>> > >> kernel: [ 301.240305] IP: [<c1238dd0>]
__list_del_entry+0x20/0xe0
>>> >>> > >> kernel: [ 301.240481] *pdpt = 0000000000000000
*pde =
>>> >>> > >> f000ddc8f000ddc8
>>> >>> > >> kernel: [ 301.240698] Oops: 0000 [#1] SMP
>>> >>> > >> kernel: [ 301.240910]
>>> >>> > >> kernel: [ 301.241030] Pid: 642, comm:
fsnotify_mark Not tainted
>>> >>> > >> 3.2.9-drbd-version3 #1 Dell Inc. PowerEdge
2950/0CX396
>>> >>> > >> kernel: [ 301.241370] EIP:
0060:[<c1238dd0>] EFLAGS: 00010287 CPU: 6
>>> >>> > >> kernel: [ 301.241498] EIP is at
__list_del_entry+0x20/0xe0
>>> >>> > >> kernel: [ 301.241623] EAX: f4fae544 EBX: f47cffa4
ECX: ffffffff EDX:
>>> >>> > >> 00000000
>>> >>> > >> kernel: [ 301.241751] ESI: f4fae544 EDI: f4fae508
EBP: f47cff7c ESP:
>>> >>> > >> f47cff64
>>> >>> > >> kernel: [ 301.241879] DS: 007b ES: 007b FS: 00d8
GS: 0000 SS: 0068
>>> >>> > >> kernel: [ 301.242005] Process fsnotify_mark (pid:
642, ti=f47ce000
>>> >>> > >> task=f4f47c00 task.ti=f47ce000)
>>> >>> > >> kernel: [ 301.242207] Stack:
>>> >>> > >> kernel: [ 301.242327] c10813c0 f47cffa4 f4f47c00
f4e70888 f47cff7c
>>> >>> > >> f47cffa4 f47cffb8 c10f6976
>>> >>> > >> kernel: [ 301.242882] ffffffc3 f4f47c00 f4f47c00
00000000 f4f47c00
>>> >>> > >> c10530c0 f47cff9c f47cff9c
>>> >>> > >> kernel: [ 301.243438] f4fae544 f4fae544 f4c47f58
00000000 c10f68f0
>>> >>> > >> f47cffe4 c1052834 00000000
>>> >>> > >> kernel: [ 301.243995] Call Trace:
>>> >>> > >> kernel: [ 301.244119] [<c10813c0>] ?
>>> >>> > >> rcu_check_callbacks+0x110/0x110
>>> >>> > >> kernel: [ 301.244248] [<c10f6976>]
fsnotify_mark_destroy+0x86/0x120
>>> >>> > >> kernel: [ 301.244377] [<c10530c0>] ?
abort_exclusive_wait+0x80/0x80
>>> >>> > >> kernel: [ 301.244504] [<c10f68f0>] ?
fsnotify_put_mark+0x30/0x30
>>> >>> > >> kernel: [ 301.244631] [<c1052834>]
kthread+0x74/0x80
>>> >>> > >> kernel: [ 301.244756] [<c10527c0>] ?
>>> >>> > >> kthread_flush_work_fn+0x10/0x10
>>> >>> > >> kernel: [ 301.244885] [<c1582ab6>]
kernel_thread_helper+0x6/0xd
>>> >>> > >> kernel: [ 301.245011] Code: 55 f4 8b 45 f8 e9 75
ff ff ff 90 55 89
>>> >>> > >> e5 53 83
>>> >>> > >> ec 14 8b 08 8b 50 04 81 f9 00 01 10 00 74 24 81 fa
00 02 20 00 0f 84
>>> >>> > >> 8e 00
>>> >>> > >> 00 00 <8b> 1a 39 d8 75 62 8b 59 04 39 d8 75
35 89 51 04 89 0a 83 c4
>>> >>> > >> 14
>>> >>> > >> kernel: [ 301.248195] EIP: [<c1238dd0>]
__list_del_entry+0x20/0xe0
>>> >>> > >> SS:ESP
>>> >>> > >> 0068:f47cff64
>>> >>> > >> kernel: [ 301.248414] CR2: 0000000000000000
>>> >>> > >> kernel: [ 301.248538] ---[ end trace
15082dbfb353f84c ]---
>>> >>> > >>
>>> >>> > >> The kernel was compiled with the following DEBUG
support (the bolded
>>> >>> > >> one
>>> >>> > >> were requested by Gentoo's Dev:
>>> >>> > >> CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
>>> >>> > >> CONFIG_SLUB_DEBUG=y
>>> >>> > >> CONFIG_HAVE_DMA_API_DEBUG=y
>>> >>> > >> CONFIG_X86_DEBUGCTLMSR=y
>>> >>> > >> CONFIG_PNP_DEBUG_MESSAGES=y
>>> >>> > >> CONFIG_AIC94XX_DEBUG=y
>>> >>> > >> CONFIG_USB_DEBUG=y
>>> >>> > >> CONFIG_DEBUG_KERNEL=y
>>> >>> > >> CONFIG_SCHED_DEBUG=y
>>> >>> > >> CONFIG_DEBUG_RT_MUTEXES=y
>>> >>> > >> CONFIG_DEBUG_PI_LIST=y
>>> >>> > >> CONFIG_DEBUG_BUGVERBOSE=y
>>> >>> > >> CONFIG_DEBUG_INFO=y
>>> >>> > >> CONFIG_DEBUG_MEMORY_INIT=y
>>> >>> > >> CONFIG_DEBUG_LIST=y
>>> >>> > >> CONFIG_DEBUG_STACKOVERFLOW=y
>>> >>> > >> CONFIG_DEBUG_RODATA=y
>>> >>> > >> CONFIG_DEBUG_RODATA_TEST=y
>>> >>> > >>
>>> >>> > >> I attached the kernel config i used for 3.2.9 to
generate this oops
>>> >>> > >> and
>>> >>> > >> warnings.
>>> >>> > >>
>>> >>> > >> From the list_add warnings that come after, out of
805 warnings i
>>> >>> > >> processed,
>>> >>> > >> after masking with XXXXX the PID and next= values
that kept changing
>>> >>> > >> in
>>> >>> > >> every one, i got 26 types of MD5. I also attached
the files relevant
>>> >>> > >> as an
>>> >>> > >> archive to this email.
>>> >>> > >>
>>> >>> > >> The Gentoo bug i opened is sleeping, it seems
nobody has the time to
>>> >>> > >> at
>>> >>> > >> least test to confirm or not the problems i'm
seeing (or everybody's
>>> >>> > >> thinking that nobody would restart auditd so
often, so the bug it's
>>> >>> > >> not that
>>> >>> > >> serious).
>>> >>> > >>
>>> >>> > >>
>>> >>> > >> Thank you for your time.
>>> >>> > >>
>>> >>> > >> On Wed, Feb 8, 2012 at 6:11 PM, Valentin Avram
<aval13(a)gmail.com>
>>> >>> > >> wrote:
>>> >>> > >>
>>> >>> > >>
>>> >>> > >> --
>>> >>> > >> Linux-audit mailing list
>>> >>> > >> Linux-audit(a)redhat.com
>>> >>> > >>
https://www.redhat.com/mailman/listinfo/linux-audit
>>> >>> > >
>>> >>> > >
>>> >>> > >
>>> >>> > > --
>>> >>> > > Peter Moody Google 1.650.253.7306
>>> >>> > > Security Engineer pgp:0xC3410038
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>>
>>> >>>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Peter Moody Google 1.650.253.7306
>>> > Security Engineer pgp:0xC3410038
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Peter Moody Google 1.650.253.7306
> Security Engineer pgp:0xC3410038
--
Peter Moody Google 1.650.253.7306
Security Engineer pgp:0xC3410038
--
Peter Moody Google 1.650.253.7306
Security Engineer pgp:0xC3410038