Hello All,
I have some systems that have just been updated to from RHEL 5.2 to RHEL 5.3. The version of auditd is 1.7.7 and SnareLinux is 1.5.0.
Some time after the update ran, I noticed that the amount of free memory on the systems had dramatically gone down. Running top, I saw that auditd had sucked up lots of memory:
top - 09:25:59 up 2 days, 2 min, 1 user, load average: 0.00, 0.03, 0.00
Tasks: 111 total, 1 running, 110 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 4043368k total, 3994488k used, 48880k free, 64656k buffers
Swap: 8385920k total, 352k used, 8385568k free, 449884k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31136 root 12 -3 6568m 3.1g 568 S 0.0 81.0 0:08.84 auditd
3861 root 15 0 26152 16m 364 S 0.0 0.4 0:00.01 restorecond
29474 root 16 0 97.9m 14m 2852 S 0.0 0.4 0:00.16 sshd
29382 root 17 0 124m 13m 1464 S 0.0 0.3 0:00.09 crond
29478 gsmith 15 0 97.9m 12m 976 S 0.0 0.3 0:00.01 sshd
5611 root 15 0 58912 9048 4208 S 0.0 0.2 3:50.38 dsmc
3937 root 15 0 96556 7588 916 S 0.0 0.2 3:54.00 sec.pl
When I first started looking at the problem, I figured the easiest fix was reboot since a new version of auditd was installed along with a new kernel and there was a mismatch between the auditd daemon and the running kernel. After rebooting, I watched a 'top' display and did not see the auditd daemon percolate up. It turns out that it does percolate up to the top of the memory usage; I didn't watch it long enough.
I wondered if there was a difference between 32 bit and 64 bit systems. As it turns out, there isn't one; it's seems to be slower to happen on 32 bit systems probably due to slower CPU's
Next, I wondered if this is due to the version of Snare running on the systems. Just to be sure, I rebuilt Snare for the 64 bit system, de-installed Snare, installed the newly rebuilt Snare, and started auditd. Again, the auditd daemon started to percolate to the top of the memory usage.
So, does this have anything to do with the number of rules that auditd has to deal with? I started trimming back the rules one at a time, restarting auditd and watching a top display. The number of rules doesn't have any affect until you get to zero rules and then auditd doesn't move up.
Next, I ran a valgrind on auditd; one on a RHEL 5.3 system with no rules and another one with the standard set of rules put on our systems.
criticality=0 event=execve uid=*,root
criticality=1 event=(mount,umount,umount2,settimeofday,swapon,swapoff, reboot,setdomainname,create_module,delete_module,quotactl) uid=*,root
criticality=2 event=execve exe=*passwd*
criticality=3 event=(login_auth,login_start,logout)
criticality=4 event=acct_mgmt
After 2 hours of running the former had 0 malloc/free problems with the leak summary. The latter on the other hand:
==30021==
==30021== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 1) ==30021== malloc/free: in use at exit: 147,024,501 bytes in 16,262 blocks.
==30021== malloc/free: 27,612 allocs, 11,350 frees, 249,669,818 bytes allocated.
==30021== For counts of detected errors, rerun with: -v ==30021== searching for pointers to 16,262 not-freed blocks.
==30021== checked 10,651,128 bytes.
==30021==
==30021== LEAK SUMMARY:
==30021== definitely lost: 146,976,704 bytes in 16,246 blocks.
==30021== possibly lost: 36,464 bytes in 5 blocks.
==30021== still reachable: 11,333 bytes in 11 blocks.
==30021== suppressed: 0 bytes in 0 blocks.
Next, I ran a strace of auditd to see if I could spot where it was doing all the malloc's. Starting up a strace of fresh instance of auditd, eventually auditd settled into this loop:
clock_gettime(CLOCK_MONOTONIC, {432334, 891462998}) = 0
epoll_wait(0, {{EPOLLIN, {u32=3, u64=3}}}, 64, 59743) = 1
clock_gettime(CLOCK_MONOTONIC, {432334, 891681998}) = 0
recvfrom(3, "&\0\0\0\33\5\0\0\0\0\0\0\0\0\0\0audit(1233595754"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 54
writev(5, [{"\0\0\0\0\20\0\0\0\33\5\0\0&\0\0\0", 16}, {"audit(1233595754.282:836040): c"..., 38}], 2) = 54
And so it goes. Nothing leaps out at me from this tight code sequence that it's glomming memory.
I wondered if this odd behavior was exhibited back in RHEL 5.2 and we just never noticed. So, I installed RHEL 5.2 in a virtual machine, installed all the patches, and installed Snare. After several hours of going at it, auditd has not moved up the chart at all.
So, it appears to be an auditd/Snare interaction on RHEL 5.3. I have a RHEL 5.3 system on which I'm experimenting with auditd, rulesets and prelude (no Snare). The ruleset has a lot of rules in it by comparison (> 200) and everybody is well-behaved.
Any thoughts on what may be happening?
Best Regards,
Gary Smith