Hello,
I would like to present our plan for using audit briefly. We have made a
prototype implementation, and discovered some things along the way.
We are making a middleware for ATC systems. We are writing it in Ada and
partially in Python. In Python we do mostly the prototypes, so the prototype
code is in Python.
For that, we have one problem, to uniquely identify a process that
communicated with the outside world. We have settled with the process start
date. That date can be determined in a way so that it is stable
(using /proc/stat btime field, elf note for Hertz value, and then translate
ticks from /proc/pid/stat into a date) and reproducible outside of the
process. Given the pid and start_date, we can check if a process is still
alive, reliably. The method is notably different from what ps does, which may
(or so I propose after looking at the source) output different start times in
different runs.
We have a daemon running that may or may not fork processes that it monitors,
for the communicating ones, we want to be able to tell everybody in the
system (spanning several nodes) that a communication partner is no more, for
non-communicating ones we simply want to observe and report that e.g. ntpd or
some monitoring/working shell script is running or not.
The identifier hostname/pid/start_date is therefore what what we call a "life"
of a process. It may restart, but the pid won't wrap around within one tick,
that is at least the limiting restriction.
Now one issue, I see is that the times that we get from auditd through the
socket from its child daemon may not match the start_date exactly. I think
they could. Actually we would prefer to receive the tick at which a process
started, instead of a absolute time dated fork event, because then we could
apply our code to calculate the stable time. Alternatively it would be nice
to know how the time value from auditd comes into existance. In principle
it's true, that for every event we should actually get the tick over a date,
at least both. Ticks are the real kernel time, aren't they?
Currently we feel we should apply a delta around the times to match them, and
that's somehow unstable methinks. We would prefer delta to be 0. Otherwise we
may e.g. run into pid number overruns much easier.
The other thing is sequence numbers. We see in the output sequence numbers for
each audit event. Very nice. But can you confirm where these sequence numbers
are created? Are they done in the kernel, in auditd or in its child daemon?
The underlying question is, how safe can we be that we didn't miss anything
when sequence numbers don't suggest so. We would like to use the lossless
mode of auditd. Does that simply mean that auditd may get behind in worst
case?
Then, we have first looked at auditd 1.2 (RHEL3), auditd 1.6 (RHEL5/Ubuntu)
and auditd 1.7 (Debian and self-compiled for RHEL 5.2). The format did
undergo important changes and it seems that 1.7 is much more friendly to
parse. Can you confirm that a type=EOE delimits every event (is that even the
correct term to use, audit trace, how is it called).
We can't build the rpm due to dependency problems, so I was using the hard
way, ./configure --prefix=/opt/auditd-1.7 and that works fine on our RHEL 5.2
it seems. What's not so clear to (me) is which kernel dependency there really
is. Were there interface changes at all? The changelog didn't suggest so.
BTW: Release-wise, will RHEL 5.3 include the latest auditd? That is our target
platform for a release next year, and it sure would be nice not to have to
fix up the audit installation.
One thing I observed with 1.7.4-1 from Debian Testing amd64 that we won't ever
see any clone events on the socket (and no forks, but we only know of cron
doing these anyway), but all execs and exit_groups.
The rules we use are:
# First rule - delete all
-D
# Increase the buffers to survive stress events.
# Make this bigger for busy systems
-b 320
# Feel free to add below this line. See auditctl man page
-a entry,always -S clone -S fork -S vfork
-a entry,always -S execve
-a entry,always -S exit_group -S exit
Very strange. Works fine with self-compile RHEL 5.2, I understand that you are
not Debian guys, I just wanted to ask you briefly if you were aware of
anything that could cause that. I am going to report that as a bug (to them)
otherwise.
With our rules file, we have grouped only similar purpose syscalls that we
care about. The goal we have is to track all newly created processes, their
exits and the code they run. If you are aware of anything we miss, please
point it out.
Also, it is true (I read that yesterday) that every syscall is slowed down for
every new rule? That means, we are making a mistake by not having only one
line? And is open() performance really affected by this? Does audit not
(yet?) use other tracing interface like SystemTap, etc. where people try to
have 0 cost for inactive traces.
Also on a general basis. Do you recommend using the sub-daemon for the job or
should we rather use libaudit for the task instead? Any insight is welcome
here.
What we would like to achieve is:
1. Monitor every created process if it (was) relevant to something. We don't
want to miss a process however briefly it ran.
2. We don't want to poll periodically, but rather only wake up (and then with
minimal latency) when something interesting happened. We would want to poll a
periodic check that forks are still reported, so we would detect a loss of
service from audit.
3. We don't want to possible loose or miss anything, even if load gets higher,
although we don't require to survive a fork bomb.
Sorry for the overlong email. We just hope you can help us identify how to
make best use of audit for our project.
Best regards,
Kay Hayen