On Wed, Nov 5, 2008 at 4:19 PM, Steve Grubb <sgrubb(a)redhat.com> wrote:
On Wednesday 05 November 2008 11:30:16 Lucas C. Villa Real wrote:
> I'm facing a situation where -ENOBUFS is returned from both
> audit_send() and audit_get_reply(). The system is under high stress,
> with 250k files being created and having creat() and chmod() syscalls
> audited.
Is this what you really wanted to audit? :)
Yes, not a single event can be missed in the system I'm working on,
unfortunately :)
> Looking the code at lib/netlink.c, I saw that audit_send()
doesn't
> handle -ENOBUFS. Would it be possible to replace the condition from
> "while (retval < 0 && errno == EINTR)" to "while (retval
< 0 && (errno
> == EINTR || errno == ENOBUFS))" to fix the problem when sending
> packets from userspace to kernel?
Have you tried that? Does it fix the problem or just hang the utility?
So far it didn't hang. However, just in case, I added a maximum number
of retries (currently set to 64). I'm about to launch a new batch to
stress the system once again, and then I'll be able to see if it works
as expected.
> My understanding for the problem in audit_get_reply() is that the
I/O
> buffers are all full and auditd was just not scheduled at the expected
> rate, causing these buffers to overflow. Does that make sense?
If you go over the backlog limit, you get a syslog message about that unless
you have it set to ignore. My guess would be that you have a general network
memory pool depletion and is not related to audit specifically.
Yes. I hope that increasing auditd's priority will help to drain that.
I'll let you know if that works.
> If it does, do you have a suggestion about the best way to
approach this
> problem, besides changing auditd's priority?
Increase the backlog and increase auditd's priority. I have not played with
running auditd with a different scheduler policy than whatever the default
is. But you may want to see if one of the other scheduler polices treat audit
better. or maybe you want to tune /proc/sys/kernel/sched_granularity_ns.
> One interesting thing which I noticed is that 'auditctl -s' doesn't
> report that messages were lost,
They weren't lost by the audit system so it doesn't know they didn't arrive.
Do you think it would make sense to add an extra member to struct
sk_buff (a pointer to a callback function) and then have
skb_queue_tail() signal if it failed to send a message? That would
allow audit to keep track of such losses, as well as any other
subsystem using netlink for communicating with userspace.
> This is happening with an old kernel, 2.6.16.46 + a bunch of
patches,
> and audit 1.7.4. I cannot completely upgrade it to a new release, but
> I can certainly backport audit specific bits if you remember having
> fixed something similar since then.
Well, that proc tunable is only available for the CFS scheduler. Not sure what
you have for older kernels.
It's not, but I'll keep looking for other ways to improve the
responsiveness of auditd here.
Thanks!
Lucas