On Fri, 2010-11-19 at 11:20 -0500, Steve Grubb wrote:
I didn't answer right away because I didn't have a good answer for you. If the
storm
is large enough to overrun the kernel queue, the rate limiting needs to be in the
kernel. If auditd is able to handle the load, then perhaps you need an analysis plugin
that performs whatever action you deem best.
Steve,
I understand; it isn't a straightforward thing and I appreciate you
thinking about it. I think I have settled on a workable solution.
I am using the unix audisp builtin and I am sampling the AVC events.
I've got a non-blocking mechanism whereby I can count the AVCs on a very
small number of senders. Then I can take action against the offenders
(kill). Not perfect, has issues, but might be satisfactory.
I'm still testing this sampling approach, making certain I don't
introduce any blockage points, which would aggravate the issue.
And while this may work on a single process sending thousands of AVCs in
a tight loop, it wouldn't work on one which gets respawned, unless I
look at the ppid or do something more clever.
What is the general source of the problem right now? Was it just that the app was
doing something that policy didn't know it could do? Or was there attacks under way
that someone was trying something bad? Or was its just an admin mistake where
something didn't have the right label? Each of these has a different solution.
Mostly the first scenario you mention - that the 3rd-party application
hit an execution path we had not seen in testing. But of course it
doesn't have to be a 3rd-party app. Even ones we create can run amok
with AVCs if all code paths are not exercised under all data conditions.
Basically untestable in finite time by humans.
:)
Some things you never know the code will do - for example in one error
recovery case I believe some process (or library it uses) decides to go
look a different running process and then wants to figure out which
connections it has. Well, it doesn't get an answer because of course it
isn't policy-able to see the /proc details or some such thing, generates
AVCs, and it is in a loop until it gets an answer (forever).
Or things which are normally working fine on targeted-policy systems get
confused on MLS systems because they cannot connect to the server when
they are invoked for a process running at a higher/lower/incomparable
MLS level. Then they retry a few million times or so...
Or a process decides to see which files it can access in a big data
store. All the ones it cannot, for MAC level (MLS) reasons, all generate
AVCs. A few hundred isn't a big deal; a few million is.
Funny things happen to systems when you subject them to the real world
and real users.
:)
I think this is a complex problem and controls might be needed at several spots. I'd
be open to hearing ideas on this too. I've also been wondering if the audit daemon
might want to use control groups as a means of keeping itself scheduled for very busy
systems. But i'd like to hear other people's thoughts.
I agree on the complexity. At the very least though I'd think adding a
syslog-like function whereby it can assimilate same-event audits and
then submit one event like "1000 similar events like this" would be
good.
Likely 1000 isn't even enough. At one point we were getting well over
1500 AVCs/second over a period of days. On a weekend of course. :)
Actually we were able to process that amount. I have no data on the
number of drops.
Tends to add right up. And this is just one sending host (there are
others but they are not as busy). If I had multiples, the aggregating
machine would be overrun. As processors/hardware get faster, I assume
the AVC error rates will too.
In my case, the concern is that a valuable event will be dropped off the
queue due to others like I described taking all the resources. Even
though I have increased the audispd queue size and the priorities, at
some point saturation will inevitably occur.
Thanks again!
LCB
--
LC (Lenny) Bruzenak
lenny(a)magitekltd.com