LC Bruzenak wrote:
Has anyone been thinking about how to store/maintain the aggregated
audit data long-term?
In my setup, I will be sending data from several machines to one central
log host.
After a while, the number of logs/data will grow large. With hundreds of
files, the rotate will take more time and the audit-viewer "select
source" option becomes tedious. Most of my searches involve
time/host/user. Using the prelude plugin helps a lot, because it
highlights what is otherwise hidden in the data pool. But pulling out
that record from a selection of log files isn't currently intuitive.
I would think we'd put these into a RDB or structure them by time
directory structure something like year/month/week ... or maybe
something else entirely. I'm thinking also about ease of backup/restore
with incoming records. I'd hate to shut down all the sending clients
just to backup or restore my audit data, so that part will need to
operate asynchronously.
Before striking out on my own I thought I'd ask the list and see if
there are any such plans already in the works.
Yes, we plan on addressing many of these issues in IPA, not just for
kernel audit data, but for all log data (e.g. Apache error log, Kerberos
access log, SMTP logs, etc.). The basic idea is that there is will be a
central server which accepts log data from individual nodes. The log
data can be signed for authenticity and will be robustly transported via
AMQP with fail over and guaranteed delivery. The log data will be
compressed. You can specify which logs you want collected, their
collection interval, along with record level filtering. Once on the
server the log meta data is entered into a "catalogue" (a relational
database) which along with the meta data stores where the actual log
data can be found on disk. The disk files will be optimized for
compression and access. The catalogue manager will be able to
reconstruct any portion of a log file (stream) from any node within a
time interval. This can be used for external analysis tools, compliance
reporting etc. The catalogue will be capable of intelligently archiving
old log data and restoring it back into a "live catalogue". This is what
is planned for v2 of IPA, which is anticipated to be about 1 year from
now. In v3 of IPA the audit catalogue will support search and reporting
on *all* the log data in the catalogue (not just audit.log but all log
data). In v3 when data arrives at the catalogue it will be indexed for
fast search and retrieval. Search will be based on tokens and key/value
pairs and will accept constraints on nodes, time intervals, users, etc.
(Note a relational database will NOT be used to support searching,
rather searches will be performed via optimized reverse indexes on
textural tokens, the use of an RDB will only be for managing the
collection of log files)
A note about vocabulary: in "IPA land" when we say "audit data" or an
"audit catalogue" or "audit search" the term "audit" refers
to any log
data, of which kernel audit data is just one subset.
As a suggestion, the prewikka viewer seems like a workable model. I
realize that viewer is built around the IDS structure, but as an event
search tool it is pretty good and mostly complete. Having network access
to it is also a nice feature.
So right now I think that feeding the events into a DB and then using a
tool with the same capabilities as are in the prewikka viewer would be a
viable option. Others? Ideas?
Thanks in advance,
LCB.
--
John Dennis <jdennis(a)redhat.com>