RE: Problem with auditd/SnareLinux on RHEL 5.3 - auditd glomming memory

Thursday, 19 February 2009

Hello All, 

I looked at the source code for auditd and came up with the following
fix for the behavior I described.

You may have a better fix for the issue.

When the setting for the output log format is set to "NOLOG" (log_format
= NOLOG in auditd.conf) it appears that audit events are getting stacked
up in the internal message queue (audit_reply_list) and are not removed
from the stack after being written to the audit dispatcher daemon. The
result is the stack grows without end.

I have the following potential fix for audit version 1.7.11:

In "auditd.c" 

171c171
<       if (rep->reply.type != AUDIT_EOE) {
---
...
       if (rep->reply.type != AUDIT_EOE &&
config.log_format != LF_NOLOG) {

I've rebuilt the rpm and tried it on a RHEL 5.3 i386 system with
2.6.18-128.el5 kernel and all is well with auditd.

This may not be the best fix for the excessive memory glomming I've
seen.

Best regards,

Gary Smith

-----Original Message-----
From: Steve Grubb [mailto:sgrubb@redhat.com] 
Sent: Thursday, February 05, 2009 8:33 AM
To: linux-audit(a)redhat.com
Cc: Lucas C. Villa Real; Smith, Gary R
Subject: Re: Problem with auditd/SnareLinux on RHEL 5.3 - auditd
glomming memory

On Wednesday 04 February 2009 09:14:03 pm Lucas C. Villa Real wrote:
...
 2009/2/4 Smith, Gary R <gary.smith(a)pnl.gov&gt;:
 I noticed a very similar behavior when the system was under high
 stress (ie: having many rules and many remote clients generating audit
 events). After much debugging, it was found that the asynchronous
 nature of netlink made it possible for auditd's queue to grow wildly,
 until the kernel started to kill other processes due to OOM (auditd
 asks the kernel not to be killed under OOM conditions, so every
 process but auditd is shot). 
Yes, I think auditd is blamed for the memory consumption related to it
inside 
the kernel. I have run valgrind against the audit daemon many times and
I 
know of no resource leaks. The only knob you really have to turn if the 
kernel queue is a problem is to increase the priority of the audit
daemon so 
it gets more run time.

...
 The reason was that audit's consumer thread -- the one that runs
 auditd-event.c:event_thread_main() -- was consuming events slower than
 the rate in which netlink events were sent from the kernel to auditd's
 main thread. 
This is because of some requirements on CC evals about knowing how many
events 
are in flight. The input queue is simply 1. If you have the audit event 
dispatcher running, it gets first shot at handling the event. Then the
event 
goes to disk. If you have synchronous logging then write blocks for a
while. 
So, changing to buffered IO might be better for throughput.

...
 The solution we found (and which is still being tested) was to
define
 a high water mark on how many events to allow auditd to have in its
 input queue. Given that each netlink message takes about 9kb, one can
 set the high water mark to e.g: 500000 to have at most 4.5GB events in
 RAM. So, when auditd reaches that high water mark, we ask the kernel
 to slow down: all further events sent by the kernel have a "need an
 ack" flag included so that the caller process (the one that generated
 the system call that had to be audited) gets blocked until a reply is
 sent from the daemon. 
Originally, I thin David Woodhouse patched the kernel so that callers
were put 
on a wait queue when we hit the end of the internal queue. I think
someone 
removed it thinking the system becomes unresponsive.

...
 Please let me know if that happens to be the reason of the problems
 you're having. I've been working mostly with audit 1.7.4 and kernel
 2.6.16.16+patches, so our changes still need to be ported to a recent
 kernel and audit package before they're submitted officially (that's
 likely to happen in march, after my master thesis' final deadline --
 which is driving me crazy). 
Note that the event model changed inside the audit daemon around 1.7.5.
Its 
very different than before. In the near future, I am planning to pull
the 
code from audispd into auditd and use the queue code from audispd so
that 
input and output of auditd can really become multi-threaded. I'm
thinking 
this would allow better dequeuing of kernel events.

-Steve

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

RE: Problem with auditd/SnareLinux on RHEL 5.3 - auditd glomming memory