So we've been struggling with getting audisp-remote working in a
reliable manner. In summary, it works but the networking seems fragile.
We are using Kerberos authentication with audisp-remote, but that
doesn't seem to be related to the fragility (sadly the Kerberos support
does make it trivial to completely hang the server, but that's another
issue). This is on RHEL 7 which ships with audit-2.8.5, but as far as
I can tell the relevant code hasn't changed much from there to what
is on GitHub.
After staring at the code a lot and doing some experiments, here's what
I believe to be true. I'll gladly take corrections for anything I get
wrong.
- If a connection has _never_ been made successfully by audisp-remote,
it will retry the connection (in theory there's a limit to retries,
but that seems to be per-message; it will retry on every new message).
Fine, that seems reasonable.
- If the connection is lost for almost any reason (see below), the connection
is never retried using the default configuration. There might be some
corner cases where a retry can happen, but in my experience that is rare.
Once it's gone, it never gets retried, and audit messages build up until
the queue overflows.
- In theory if a graceful shutdown is received by audisp-remote (either
a zero-length read or a "ENDING" audit message), then retries can
happen; this is indicated by the "remote_ended" flag in the code. But
in my experience that is rare; during my experiments when I rebooted
our audit server that message was never sent (I guess the audit server
stop was received after the interfaces were shut down). If the audit
server crashes or you have a network failure, you end up getting an
error on a write and then the network is marked down and you get into
never-retry state.
- If you turn on heartbeats via heartbeat_timeout, the network connection
_will_ retry when a heartbeat is sent. However, the subtle issue here
is that a heartbeat is only sent when there are no incoming audit messages
within the heartbeat timeout.
The key issue seems to be in this part of the loop in main() (this section is
entered when audisp-remote receives an audit record):
// See if input fd is also set
if (FD_ISSET(ifd, &rfd)) {
do {
if (remote_fgets(event, sizeof(event), ifd)) {
if (!transport_ok && remote_ended
&&
(config.remote_ending_action ==
FA_RECONNECT ||
!connected_once)) {
quiet = 1;
if (init_transport() ==
ET_SUCCESS) {
remote_ended = 0;
connected_once = 1;
}
quiet = 0;
}
In short, when a new audit record is received, init_transport()
(which tries to connect to the audit server) is only called _IF_ the
connection is down (transport_ok == 0) _and_ remote_ended is true _and_
remote_ending_action is set to FA_RECONNECT (the default) _or_ there
hasn't been at least one successful connection (connected_once == 0).
The problem with that is at least in our environment remote_ended is
never set to 1, so when the connection drops it is never retried, and
there aren't any other entry points in the normal event loop that would
ever cause the connection to retry.
The heartbeat code calls relay_event() directly (code that sends audit
events normally calls send_one() which returns if transport_ok is false)
and relay_event() calls either relay_sock_ascii() or relay_sock_managed()
and those two functions will call init_transport() if the network
connection is down. But as mentioned above, you need to make sure that
you try to send a heartbeat every so often; if you have a server generating
audit messages constantly then there won't be a heartbeat if you set the
heartbeat timeout too high.
You _can_ get a network connection retry if you encounter an error
inside of relay_sock_ascii() or relay_sock_managed(); I can't say
that didn't happen with us, but it sure seemed like it wasn't sufficient
and having the transport marked as failed was inevitible.
So, I guess my questions are:
- Is this all accurate?
- Is this how it's SUPPOSED to be? At least for us, network glitches
happen enough that most of our hosts ended up with overflowing
audisp-remote queues. Setting the heartbeat timeout seems to have
resolved that (but it took a little experimentation to figure out
the right value). It just seems surprising that it was easy to get
into a situation where you'd never retry a connection.
--Ken