On Monday, January 4, 2021 2:55:25 AM EST Burn Alting wrote:
On Sun, 2021-01-03 at 10:41 -0500, Steve Grubb wrote:
> On Friday, January 1, 2021 4:22:33 PM EST Burn Alting wrote:
> > Sometimes, events recorded in /var/log/audit/audit.log appear some
> > seconds past co- located events which results in
> > auparse:au_check_events() marking these events complete before they
> > are. An example of this can be seen below with the offending event id
> > 44609.
> >
> > This has been plaguing me for a year or two and this morning was the
> > first time I still had access to the raw audit.log files (I monitor a lot
> > of event types and the log files roll over fairly quickly).
> > The example below is from a fully patched Centos 7 but I have also seen
> > this on a patched Fedora 32.
> >
> > Has this been seen before? Do we need to re-evaluate how auparse
> > 'completes' an event (ie 2 seconds is too quick).
>
> I have never seen this. But on the way to disk, auditd only does light
> processing of the event. If the format is enriched, it looks things up
> on a record by record basis. It does not collect events until they are
> complete - it dumps it to disk as soon as it can tack on the extra
> information.
>
> So, the question would be, does this delay happen on the way to disk? Or
> is this an artifact of post processing the logs with an auparse based
> utility? Can this be observed repeatedly on the same raw logs? If so,
> then maybe auparse does have some issue. But if this is a post
> processing issue, then the wall clock doesn't matter because this event
> should have collected up together.
>
> I'd say this merits some investigation.
OK. I think this needs to be addressed on two fronts. There may be more.
A. Within post processing ... a 2 second timeout is not sufficient. I
would suggest we modify auparse.c:au_check_events() to i) perform the
event type checks first, then
ii) increase the timeout of 2 seconds to be a larger value based on
empirical tests.
In the post processing, there are 2 use cases. The first is events that are on
disk. In this usage, the 2 second timeout does not come into effect because
the events are run through probably within nanoseconds or microseconds at the
worst. The only time it would come into effect is if the terminating record
is missing.
The other use case is realtime processing as an audispd plugin. In this use
case the wall clock could matter because records could potentially get lost
due to overflows or another plugin taking too long. This is the use case where
the wall clock matters. And again, it matters when records get lost or
delayed in transit. As long as everything is flowing, it should not factor
into event processing.
B. I will build a temporary auditd daemon to perform some empirical
testing
to see how long events can reside within the daemon. I may need some
advice on this. I assume that the code that sets the timestamp is in
src/auditd.c:send_audit_event().
This is only for audit daemon's internal events. For all "real" events,
it's
set in the kernel.
If so, I will see if I can put orchestration debug code in to monitor
an
event's 'time in daemon' until this point. I will then report on this.
I believe given that AUDIT_PROCTITLE and AUDIT_EOE is fairly widespread,
then the testing switch in A. will not be a big issue (time cost wise). It
will also mean that if we over compensate the timeout that would cause
additional memory cost in auparse() then this is mittigated.
I'd suggest breaking up the event completion tests so that an exact
collection termination reason code could be associated to the event.
With respect to 'There may be more' fronts. Are there other
points in the
'audit ecosystem' that makes use of the '2 second timeout'.
Ausearch/report has its own special copy of the event collection logic. It
should be nearly identical to what auparse does.
I will start work on this, this coming weekend if the above makes
sense.
One other thought, the current shipping code is audit-3.0, doing a diff
between it and audit-2.8.5 for the auparse directory does show some
differences in event collection/grouping/next_event. A lot of the differences
are cosmetic to fix extra whitespace or indentation. But if you skip all that,
there are some real changes that probably were because of bug reports. For
example,
@@ -259,15 +260,6 @@ static event_list_t *au_get_ready_event(
if (lowest && lowest->status == EBS_COMPLETE) {
lowest->status = EBS_EMPTY;
au->au_ready--;
- // Try to consolidate the array so that we iterate
- // over a smaller portion next time
- if (lowest == &lol->array[lol->maxi]) {
- au_lolnode *ptr = lowest;
- while (ptr->status == EBS_EMPTY && lol->maxi > 0)
{
- lol->maxi--;
- ptr = &lol->array[lol->maxi];
- }
- }
return lowest->l;
}
and
@@ -1536,6 +1550,13 @@ static int au_auparse_next_event(auparse
aup_list_create(l);
aup_list_set_event(l, &e);
aup_list_append(l, au->cur_buf, au->list_idx, au-
line_number);
+ // Eat standalone EOE - main event
was already marked
complete
+ if (l->head->type == AUDIT_EOE) {
+ au->cur_buf = NULL;
+ aup_list_clear(l);
+ free(l);
+ continue;
+ }
if (au_lol_append(au->au_lo, l) == NULL) {
free((char *)e.host);
#ifdef LOL_EVENTS_DEBUG01
I don't know if those have an effect on what you are seeing. But that is the
only substantial changes that I can see.
-Steve