On 10/22/2014 04:29 PM, Paul Moore wrote:
Well, like I said, It's probably safer that way as the code will
work
regardless. Time to break bad habits :)
I hear you. But there's working and there's working well.
As long as we don't suffer a search response degradation by changing the
assumptive order, as I said, I'm OK with going back and reworking code.
If it makes searching real data unusable, it's now broken some
operational stuff.
As it is, I already have to move the last 24 hours off to a different
area and make searching go day-by-day only. Otherwise the logs are too big.
I chose a 100MB file size a while back as a sweet spot based on the
observation that each file could be parsed in some time acceptable to
security managers.
Just some info on the amount of data I've seen in real world examples:
On a normally-busy site, each day, we will generate ~1-4GB of events. It
used to be much more, but our team has managed to keep it down. I'd say
that often more time is spent reducing the events than analyzing them.
If not, they grow to a size you simply cannot parse in people-time. I
did have way more, but we've become better adept at spotting and
preventing event storms.
Here is a current test machine's audit log directory. I have to go look
to see what went wrong, obviously something did, but this kind of thing
can and does happen in the real world.
Note the time stamps. 100MB every minute or two is moving right along.
We push all audit data off to a collector machine so it doesn't dominate
the disks or otherwise hog the business server resources. BTW - funny
that I just happened to login to this machine to check its status.
Whatever was running amok stopped soon after 00:57. This sometimes just
happens. I've seen this go on for hours more in fielded systems. Almost
always something is really wrong, but the system is supposed to maintain
security and be usable through these times as well.
[root@audit audit]# ls -alt
total 4179016
-rw-------. 1 root root 84835735 Oct 23 02:49 audit.log
drwxr-x---. 2 root root 4096 Oct 23 00:57 .
-r--------. 1 root root 104857789 Oct 23 00:57 audit.log.1
-r--------. 1 root root 104857663 Oct 23 00:56 audit.log.2
-r--------. 1 root root 104857639 Oct 23 00:55 audit.log.3
-r--------. 1 root root 104857608 Oct 23 00:54 audit.log.4
-r--------. 1 root root 104857874 Oct 23 00:52 audit.log.5
-r--------. 1 root root 104857864 Oct 23 00:51 audit.log.6
-r--------. 1 root root 104857670 Oct 23 00:50 audit.log.7
-r--------. 1 root root 104857740 Oct 23 00:49 audit.log.8
-r--------. 1 root root 104857644 Oct 23 00:48 audit.log.9
-r--------. 1 root root 104857871 Oct 23 00:47 audit.log.10
-r--------. 1 root root 104857602 Oct 23 00:46 audit.log.11
-r--------. 1 root root 104857755 Oct 23 00:45 audit.log.12
-r--------. 1 root root 104857907 Oct 23 00:44 audit.log.13
-r--------. 1 root root 104857973 Oct 23 00:43 audit.log.14
-r--------. 1 root root 104857632 Oct 23 00:42 audit.log.15
-r--------. 1 root root 104857843 Oct 23 00:42 audit.log.16
-r--------. 1 root root 104857769 Oct 23 00:41 audit.log.17
-r--------. 1 root root 104857864 Oct 23 00:40 audit.log.18
-r--------. 1 root root 104857862 Oct 23 00:39 audit.log.19
-r--------. 1 root root 104857757 Oct 23 00:38 audit.log.20
-r--------. 1 root root 104857601 Oct 23 00:36 audit.log.21
-r--------. 1 root root 104857663 Oct 23 00:35 audit.log.22
-r--------. 1 root root 104857711 Oct 23 00:34 audit.log.23
-r--------. 1 root root 104857806 Oct 23 00:33 audit.log.24
-r--------. 1 root root 104857722 Oct 23 00:32 audit.log.25
-r--------. 1 root root 104857684 Oct 23 00:31 audit.log.26
-r--------. 1 root root 104857761 Oct 23 00:30 audit.log.27
-r--------. 1 root root 104857975 Oct 23 00:29 audit.log.28
-r--------. 1 root root 104857702 Oct 23 00:28 audit.log.29
-r--------. 1 root root 104857771 Oct 23 00:27 audit.log.30
-r--------. 1 root root 104857990 Oct 23 00:25 audit.log.31
-r--------. 1 root root 104857629 Oct 23 00:24 audit.log.32
-r--------. 1 root root 104857663 Oct 23 00:23 audit.log.33
-r--------. 1 root root 104857852 Oct 23 00:22 audit.log.34
-r--------. 1 root root 104857841 Oct 23 00:21 audit.log.35
-r--------. 1 root root 104857634 Oct 23 00:20 audit.log.36
-r--------. 1 root root 104857690 Oct 23 00:19 audit.log.37
-r--------. 1 root root 104857623 Oct 23 00:18 audit.log.38
-r--------. 1 root root 104857711 Oct 23 00:17 audit.log.39
-r--------. 1 root root 104857867 Oct 23 00:02 audit.log.40
I started an "aureport -i --summary". I realize my aureport and ausearch
don't use the auparse library, but they will soon, if not already, in
the newer code.
I just wanted to give you a feel for data amounts in your
considerations. Also I have designs for some new tools I want to deploy
which definitely will use libauparse, if it's usable.
At 4GB+ of audit data, this report isn't returning yet. The nice thing
is it isn't leaking memory, or if it is, it isn't gushing (from "top"):
31485 root 20 0 217m 104m 89m R 99.7 0.4 22:29.91 aureport
So I guess from a real user perspective I'm just worried that fixing an
order dependency will have adverse effects on retrieval.
So my vote would be to promote anything which makes searching/parsing
faster.
Or we need a different paradigm entirely.
I've been doing some other stuff while writing this. The aureport isn't
done after 37+ minutes:
31485 root 20 0 220m 96m 78m R 100.0 0.4 37:22.38 aureport
I realize if I were sitting inside a google infrastructure this would be
different, but I'm not.
So when guys try to understand what happened during, in this case, the
first hour of the day, the time it takes to return the answer can make
them think it's broken.
This is why to me, speed of parse/retrieval is important.
The aureport is still cranking:
31485 root 20 0 221m 76m 57m R 100.0 0.3 44:21.15 aureport
Now if a few queries happen at the same time, and they can, we start
seeing some real work on this machine.
It's a reasonable machine - dual quad core Xeon with 24GB of RAM -
dedicated to storage and retrieval of system audit data (from "cat
/proc/cpuinfo"):
...
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5677 @ 3.47GHz
stepping : 2
cpu MHz : 1600.000
cache size : 12288 KB
So hopefully this gives you a sense of the way the audit data can stack
up; to me every millisecond/event saved is potentially hours earned.
Report is still running btw:
31485 root 20 0 222m 103m 83m R 99.9 0.4 56:18.64 aureport
From what I've read on the list, there are other users who have similar
loads.
I'm not sure if this makes any real difference in your thought process
or not, and you probably knew most of this anyway.
Just wanted to throw out some data points for consideration in case they
matter.
I waited until this morning to send this so I could see how long the
aureport took.
[root@audit audit]# time aureport -i --summary
Summary Report
======================
Range of time in logs: 10/23/2014 00:01:02.305 - 10/23/2014 09:58:40.745
Selected time for report: 10/23/2014 00:01:02 - 10/23/2014 09:58:40.745
Number of changes in configuration: 2892
Number of changes to accounts, groups, or roles: 0
Number of logins: 47
Number of failed logins: 2
Number of authentications: 2712
Number of failed authentications: 355
Number of users: 8
Number of terminals: 41
Number of host names: 7
Number of executables: 112
Number of files: 536772
Number of AVC's: 5791
Number of MAC events: 3045
Number of failed syscalls: 2502700
Number of anomaly events: 525
Number of responses to anomaly events: 0
Number of crypto events: 14
Number of keys: 36
Number of process IDs: 30494
Number of events: 8526658
real 428m11.405s
user 427m21.833s
sys 0m4.014s
Thank you!
LCB
--
LC (Lenny) Bruzenak
lenny(a)magitekltd.com