Re: Kernel audit output is inconsistent, hard to parse

Wednesday, 30 January 2008

On Tue, 2008-01-29 at 17:56 -0500, John Dennis wrote:
...
 The format of audit messages from the kernel is a mess. The
 bottom line is one cannot parse the audit messages without special
 case knowledge of each audit message because the data formatting does
 not follow any regular rules.

 I don't know how it got this way, but it really needs to be fixed.

 The primary offense is string formatting, specifically the use or
 non-use of the functions audit_log_hex() audit_log_untrustedstring(),
 and audit_log_n_untrustedstring(); depending on circumstances.

 The net result is a field value might be one of the following cases:

 1) a string without quotes (maybe a string, maybe an int, etc.)

 2) a string enclosed in quotes (implies a string with no escaped chars)

 3) a string which is represented as a sequence of hex values (not
     enclosed in quotes, but how do you distinguish this from case 1?) 
I'm certainly not going to argue this isn't problomatic.  I believe that
case #1 should go away.  Maybe when I get back from Chile I'll get a
chance to look at that.  Some fields just aren't ever strings, so I
don't see a need to waste time/space writing " around something I know
is always an int.  But if the field can be a string it should ALWAYS be
either case 2 or 3.
...

 Given the name=value formatting it is absolutely impossible to
 correctly interpret the value component unless you know how
 audit_log_format was invoked to generate the name=value pair. This
 will be dependent on the kernel version, the field name, and the audit
 record type. To be specific, during parsing only case 2 is
 unambiguous. You cannot determine between case 1 and case
 3. Heuristics based on each character being in the hexadecimal
 character set fail for a significant subset of data, thus you don't
 know if the value is a string encoded in hexadecimal which needs to be
 decoded or a string which happens to be composed of hexadecimal
 characters but is not encoded. 
It "shouldn't" depend on the kernel version, but I'm sure it does. 
I'm
going to work on eliminating case 1, which makes a lot of this mute.

...
 The answer is to make the output parsable without special case
 knowledge. It would appear many of these problems were introduced with
 the functions audit_log_hex() audit_log_untrustedstring(), and
 audit_log_n_untrustedstring() which attempt to correct for a double
 quote, white space, or non-printable character in the output
 string. However these are not used uniformly nor do they follow any
 common approach for string representations in user land (why not?). 
What approach would you like when I try to audit binary data?  You want
me to start adding \x before every byte?  Forget it, waste of space.

...
 All field values without exception need to be enclosed in quotes to
 delimit the value. Special characters inside the quotes need to
 be escaped, following some standard convention. Please, lets not
 invent a new encoding, this problem has already been solved
 elsewhere many times before! 
I don't want to waste time/effort having to write res="1" when I can
just say res=1.  I know you are complaining that you have to look at
message type, field name, and kernel version.  My opinion, you should
just have to look at field name.  We know res!="happy go lucky" so lets
just print that int and move along.  I certainly agree though that res
needs to be consistant across all message types.
...

 Also note the function audit_log_n_untrustedstring() in audit.c has a
 bug and ignores the len parameter (it iterates till it finds a NULL
 terminator even though it's supposed to stop after n chars). 
Check the audit tree, I pretty much rewrote all of these and will be in
linus's tree as soon as al pushes them...

...
 Suggested Fix:
 --------------

 Most of these problems can easily be fixed if there is exactly one
 central place to format an audit field value. The function
 audit_log_vformat() could very easily ensure consistent formatting via
 % format specifiers in the format string, e.g.:

      audit_log_format("n=%d path=%s", n, path);

 Building audit output piecewise would be deprecated, e.g. these types
 of sequences would be eliminated.

      audit_log_format(ab, " n=%d", n);
      audit_log_format(ab, " name=");
      audit_log_foo();

 and replaced with:

      audit_log_format(ab, " n=%d name=%s", foo_to_string(foo));

 Whenever audit_log_vformat() encounters a % format specifier it
 formats to a string, then it converts the string to an escaped quoted
 string, and then inserts the escaped quoted string into the buffer
 (e.g. n="123" name="foo bar\n" )

 This way the formatting is consistent, easy to apply, and is never
 special cased by the caller.

 There are no performance penalties of any note, calling a routine to
 escape only needs to be done when the format specifier is %s. Currently
 this is already done for a subset of output strings, so all we're doing
 is removing the responsibility for escaping from the caller and doing
 it consistently instead of in a subset of cases. 
I've got no problems eliminating all of the %s in the code and making
everything go through audit_log_string()
...

 I don't really care what the encoding is. I only care that it is an
 encoding with wide support. Backslash quoting is very popular,
 familiar and has many implementations. The MIME quoted-printable
 transfer encoding would be another option but might pose some problems
 with line endings. I think backslash quoting would be a good choice. I
 suspect everyone reading this message already knows exactly how to
 interpret a string with backslash escapes. 
My take, if the field name is something that can be a string you either
get field="string" or you get field=737472696e67.  Either way its easy
to recognize.  Quotes means string, no quotes means hex.

...
 Auparse is not the answer:
 --------------------------

 Auparse is not the answer to irregular kernel audit message
 formatting. First of all it forces auparse to have special case logic
 which is not 100% robust and is tied to the kernel source code
 version. 
Like I said, I wish this wasn't true.  Do you have an example so I can
learn from past mistakes?

...
 --------------

 If we do fix the format of audit messages we might as well fix some
 other inconsistencies at the same time.

 1) The initial part of AVC messages do not follow the standard
     name=value formatting used everywhere else in audit.

     a) It includes the string "avc:" which is redundant with the audit
     record type (e.g. type=AVC), the string "avc:" should be removed,
     it serves no purpose and only makes parsing much harder because of
     the inconsistency.

     b) denied|granted are bare words without a field name, it should be
     seresult="denied", once again to avoid special case parsing.

     c) The list of operations are enclosed in curly braces {} without a
     field name, this should be seperms=xxx, where xxx is a list. The
     use of curly braces to encode a list in audit data is unique. We
     should define how any audit message should encode a list of values
     and use that consistently for all audit data. While one could
     define a syntax such as "[value1, value2]" or some such, it might
     be informative to look at how other transfer mechanisms such as
     structured markup and ldap handle this case. They both utilize the
     concept of multi-valued attributes. Thus there is no list
     structure, but an attribute is allowed to repeat itself and in the
     process implicitly creates a list of values for the attribute. Thus
     {read write} might be represented as seperms="read"
     seperms="write". This regularity makes parsing much easier, it
     avoids special case syntax. 
While I have no problem with changing this, I'm not going to rewrite all
of the documentation on how to read selinux messages nor am I going to
rewrite all of the tools that parse/use selinux messages.  SELinux is a
user of the audit system and in my opinion should get to use it wants to
use it.  SELinux has had this syntax long before audit was a twinkle in
Rik Faith's eye.  What are you going to do when 'unknown kernel module
X' decides to use the audit subsystem and they don't follow your
name/value pair rules?

Audit tools needs to learn to deal with messages that are not from the
audit system (external audit users like selinux and unknown kernel
modules).  Most likely by seeing the type=field and realizing this isn't
from audit so I may not understand.  Tools that parse selinux messages
need to fully parse and understand selinux messages.  Tools that parse
audit messages need to fully parse and understand audit messages.  I
think this is 2 different operations.

...
 The formatting of name/value pairs in the kernel must be fixed, it
is
 simply impossible to correctly parse in it's current state. 
name/value pairs from the audit system itself I completely agree with.
Trying to force an external user of the audit system into the 'audit
way' doesn't make sense.  Why is the audit system trying to parse those
message anyway?  Its not actually a message from the audit system, its a
message from selinux.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: Kernel audit output is inconsistent, hard to parse