2 Weeks ago I wrote a model to go looking for certain kinds of
problems in
kerberos. The results were that it's probably leaking memory. And on the
client side, I don't think it was fully resetting all the kerberos variables
on failure - which may be contributing to the problems.
Well, in my experience that isn't the problem, see below.
> This is on RHEL 7 which ships with audit-2.8.5, but as far as
> I can tell the relevant code hasn't changed much from there to what
> is on GitHub.
There are differences. I'd trust the current code in github more than the old
code.
Weeeelll ....
I took a look. There are 32 commits between 2.8.5 and HEAD for
audisp-remote.c. It looks like they break down as:
- 4 Kerberos/GSS memory leak fixes
- 3 whitespace/typo fixes
- 4 warning fixes
- 4 misc code cleanups
- 6 code changes related to configuration or moving things around
The remaining ones that might affect the network connection:
b6c474b22f6e - audisp-remote: fix hang with disk_low_action=suspend (#254)
We did have this happen once, so yes, definitely an issue. But that wasn't
the major cause of our problems.
3e45aa959d55 - In audisp-remote, fixup remote endpoint disappearin in ascii forma
We use managed format so that isn't an issue.
10dde069d1a - Dont look for stop on exit while draining the queue
That only affects things when audisp-remote is exiting, not during the
main loop.
9debebcc066 - Fixup krb5 broken by T_KRB5 & T_TCP separation
"Maybe" would cause an issue, but ... see below!
In github, the first connection should do unlimited retries.
Sure, but at least _for us_, that isn't the issue. It's when a connection
is lost after the first one.
> - If the connection is lost for almost any reason (see below),
the
> connection is never retried using the default configuration. There might
> be some corner cases where a retry can happen, but in my experience that
> is rare. Once it's gone, it never gets retried, and audit messages build
> up until the queue overflows.
The behavior for what to do became a configuration item around 3.0.
Ummm ... so I am trying to understand what you mean there, Yes, I see
there is a configuration item for _startup_ errors added since 2.8.5,
but like I said that's not the problem we encounter.
> - In theory if a graceful shutdown is received by audisp-remote
(either
> a zero-length read or a "ENDING" audit message), then retries can
> happen; this is indicated by the "remote_ended" flag in the code.
This would happen if, for example, the aggregating server needed to reboot.
Right, but, like I said in my original message, at least during my
testing that never happened.
It is advisable to use the heartbeat option. This way each end can
detect the
other "disappeared" for some reason.
Well, the default configuration is that heartbeats are turned off, so
the general impression I would take away from that is you should only
turn on heartbeats if you have some unusual requirement.
> The key issue seems to be in this part of the loop in main()
(this section
> is entered when audisp-remote receives an audit record):
[...]
> In short, when a new audit record is received, init_transport()
> (which tries to connect to the audit server) is only called _IF_ the
> connection is down (transport_ok == 0) _and_ remote_ended is true _and_
> remote_ending_action is set to FA_RECONNECT (the default) _or_ there
> hasn't been at least one successful connection (connected_once == 0).
>
> The problem with that is at least in our environment remote_ended is
> never set to 1, so when the connection drops it is never retried, and
> there aren't any other entry points in the normal event loop that would
> ever cause the connection to retry.
I want to think this has been fixed in the current code. It is one of the
subtle changes since 2.8.5.
I ... do not believe this is true!
Generally when there is a communication problem then transport_ok is set
to 0 and sock is set to -1 (stop_transport() does this).
In the main loop, if sock == -1 it is never set in the fd_set, so you
never try to send anything. Even if you _do_ happen to call send_one(),
it will return if transport_ok == 0. The only time init_transport()
is called is called in the main loop is if transport_ok == 0 _and_
remote_ended == 1, and like I said we never get remote_ended == 1
even with a auditd server reboot.
So, really ... _if_ heartbeats are _not_ set, I can't see code path
that would ever result in a reconnect. I'd love to be proven wrong!
This actually should be easy to test; just make sure heartbeats
are not configured send audisp-remote a HUP signal; that will call
stop_transport(), and then see if the connection is ever reconnected or
not. That should act the same whether or not you're using GSS-API.
If the answer is "you should use heartbeats", well ... fair enough.
But it might be worth making those the default, and maybe make sure
if the transport is down you're trying a heartbeat no matter what the
heartbeat interval (because right now the code as written only will send
a heartbeat _if_ you don't get any audit events within the heartbeat
interval; that means if you're continually getting audit events you'll
never trigger a heartbeat).
I know there are people on this list that are using it reliably in
production. But, the problems were worked out mostly in the 3.0 release. The
kerberos code is donated code. I have not personally tested it myself due to
the problems in setting up the infrastructure. But from my review 2 weeks
ago, it looks like it would have problems in any error situation. I committed
some updates today which should make krb5 support better.
I would like to speak to those people who use it reliably in production!
Specifically, do they have heartbeats configured?
As long as I have you ... there is one additional issue I think that
is worth mentioning. If you have GSS configured you can hang an aggregation
server hard by doing:
% telnet aggregation-server 60
The problem is while nearly all of auditd uses a libev event loop, the
function ar_read() calls read() without a timeout, and it blocks and
none of the other connections get serviced. This can happen if you
are doing something like network scanning, or you have a misconfigured
audisp-remote client. I think the only long-term solution there is to
make sure ar_read (or maybe recv_token()) uses the ev event loop;
I know that's not easy.
The non-kerberos code has been heavily tested. You might try that to
see if
it works better. But if you are on the old code, there were problems fixed in
the 3.0 release. I think people using it are not using the krb5 code and
create a vpn or ssh tunnel for encryption.
Well, it's a large effort to use a non-vendor RPM here _and_ the STIGs
mandate the use of krb5 with audisp-remote (I know people have asked
for exceptions successfully, but having been involved with that process
I know the less exceptions you ask for, the better). Just from my
analysis the core networking code hasn't really changed in any way that
would change the basic problem. Like I said, I am open to being proven
wrong! I'd be intersted in hearing from others who have used audisp-remote
successfully in production, Kerberos or not.
--Ken