On Mon, Aug 14, 2017 at 1:47 AM, Richard Guy Briggs <rgb(a)redhat.com> wrote:
Hi David,
I wanted to respond to this thread to attempt some constructive feedback,
better late than never. I had a look at your fsopen/fsmount() patchset(s) to
support this patchset which was interesting, but doesn't directly affect my
work. The primary patch of interest to the audit kernel folks (Paul Moore and
me) is this patch while the rest of the patchset is interesting, but not likely
to directly affect us. This patch has most of what we need to solve our
problem.
Paul and I agree that audit is going to have a difficult time identifying
containers or even namespaces without some change to the kernel. The audit
subsystem in the kernel needs at least a basic clue about which container
caused an event to be able to report this at the appropriate level and ignore
it at other levels to avoid a DoS.
While there is some increased risk of "death by audit", this is really
only an issue once we start supporting multiple audit daemons; simply
associating auditable events with the container that triggered them
shouldn't add any additional overhead (I hope). For a number of use
cases, a single auditd running outside the containers, but recording
all their events with some type of container attribution will be
sufficient. This is step #1.
However, we will obviously want to go a bit further and support
multiple audit daemons on the system to allow containers to
record/process their own events (side note: the non-container auditd
instance will still see all the events). There are a number of ways
we could tackle this, both via in-kernel and in-userspace record
routing, each with their own pros/cons. However, how this works is
going to be dependent on how we identify containers and track their
audit events: the bits from step #1. For this reason I'm not really
interested in worrying about the multiple auditd problem just yet;
it's obviously important, and something to keep in mind while working
up a solution, but it isn't something we should focus on right now.
We also agree that there will need to be some sort of trigger from
userspace to
indicate the creation of a container and its allocated resources and we're not
really picky how that is done, such as a clone flag, a syscall or a sysfs write
(or even a read, I suppose), but there will need to be some permission
restrictions, obviously. (I'd like to see capabilities used for this by adding
a specific container bit to the capabilities bitmask.)
To be clear, from an audit perspective I think the only thing we would
really care about controlling access to is the creation and assignment
of a new audit container ID/token, not necessarily the container
itself. It's a small point, but an important one I think.
I doubt we will be able to accomodate all definitions or concepts of
a
container in a timely fashion. We'll need to start somewhere with a minimum
definition so that we can get traction and actually move forward before another
compelling shared kernel microservice method leaves our entire community
behind. I'd like to declare that a container is a full set of cloned
namespaces, but this is inefficient, overly constricting and unnecessary for
our needs. If we could agree on a minimum definition of a container (which may
have only one specific cloned namespace) then we have something on which to
build. I could even see a container being defined by a trigger sent from
userspace about a process (task) from which all its children are considered to
be within that container, subject to further nesting.
I really would prefer if we could avoid defining the term "container".
Even if we manage to get it right at this particular moment, we will
surely be made fools a year or two from now when things change. At
the very least lets avoid a rigid definition of container, I'll
concede that we will probably need to have some definition simply so
we can implement something, I just don't want the design or
implementation to depend on a particular definition.
This comment is jumping ahead a bit, but from an audit perspective I
think we handle this by emitting an audit record whenever a container
ID is created which describes it as the kernel sees it; as of now that
probably means a list of namespace IDs. Richard mentions this in his
email, I just wanted to make it clear that I think we should see this
as a flexible mechanism. At the very least we will likely see a few
more namespaces before the world moves on from containers.
In the simplest usable model for audit, if a container (definition
implies and)
starts a PID namespace, then the container ID could simply be the container's
"init" process PID in the initial PID namespace. This assumes that as soon as
that process vanishes, that entire container and all its children are killed
off (which you've done). There may be some container orchestration systems
that don't use a unique PID namespace per container and that imposing this will
cause them challenges.
I don't follow how this would cause challenges if the containers do
not use a unique PID namespace; you are suggesting using the PID from
in the context of the initial PID namespace, yes?
Regardless, I do worry that using a PID could potentially be a bit
racy once we start jumping between kernel and userspace (audit
configuration, logs, etc.).
If containers have at minimum a unique mount namespace then the root
path
dentry inode device and inode number could be used, but there are likely better
identifiers. Again, there may be container orchestrators that don't use a
unique mount namespace per container and that imposing this will cause
challenges.
I expect there are similar examples for each of the other namespaces.
The PID case is a bit unique as each process is going to have a unique
PID regardless of namespaces, but even that has some drawbacks as
discussed above. As for the other namespaces, I agree that we can't
rely on them (see my earlier comments).
If we could pick one namespace type for consensus for which each
container has
a unique instance of that namespace, we could use the dev/ino tuple from that
namespace as had originally been suggested by Aristeu Rozanski more than 4
years ago as part of the set of namespace IDs. I had also attempted to
solve this problem by using the namespace' proc inode, then switched over to
generate a unique kernel serial number for each namespace and then went back to
namespace proc dev/ino once Al Viro implemented nsfs:
v1
https://lkml.org/lkml/2014/4/22/662
v2
https://lkml.org/lkml/2014/5/9/637
v3
https://lkml.org/lkml/2014/5/20/287
v4
https://lkml.org/lkml/2014/8/20/844
v5
https://lkml.org/lkml/2014/10/6/25
v6
https://lkml.org/lkml/2015/4/17/48
v7
https://lkml.org/lkml/2015/5/12/773
These patches don't use a container ID, but track all namespaces in use for an
event. This has the benefit of punting this tracking to userspace for some
other tool to analyse and determine to which container an event belongs.
This will use a lot of bandwidth in audit log files when a single
container ID that doesn't require nesting information to be complete
would be a much more efficient use of audit log bandwidth.
Relying on a particular namespace to identify a containers is a
non-starter from my perspective for all the reasons previously
discussed.
If we rely only on the setting of arbitrary container names from
userspace,
then we must provide a map or tree back to the initial audit domain for that
running kernel to be able to differentiate between potentially identical
container names assigned in a nested container system. If we assign a
container serial number sequentially (atomic64_inc) from the kernel on request
from userspace like the sessionID and log the creation with all nsIDs and the
parent container serial number and/or container name, the nesting is clear due
to lack of ambiguity in potential duplicate names in nesting. If a container
serial number is used, the tree of inheritance of nested containers can be
rebuilt from the audit records showing what containers were spawned from what
parent.
I believe we are going to need a container ID to container definition
(namespace, etc.) mapping mechanism regardless of if the container ID
is provided by userspace or a kernel generated serial number. This
mapping should be recorded in the audit log when the container ID is
created/defined.
As was suggested in one of the previous threads, if there are any
events not
associated with a task (incoming network packets) we log the namespace ID and
then only concern ourselves with its container serial number or container name
once it becomes associated with a task at which point that tracking will be
more important anyways.
Agreed. After all, a single namespace can be shared between multiple
containers. For those security officers who need to track individual
events like this they will have the container ID mapping information
in the logs as well so they should be able to trace the unassociated
event to a set of containers.
I'm not convinced that a userspace or kernel generated UUID is
that useful
since they are large, not human readable and may not be globally unique given
the "pets vs cattle" direction we are going with potentially identical
conditions in hosts or containers spawning containers, but I see no need to
restrict them.
From a kernel perspective I think an int should suffice; after all,
you can't have more containers then you have processes. If the
container engine requires something more complex, it can use the int
as input to its own mapping function.
How do we deal with setns()? Once it is determined that action is
permitted,
given the new combinaiton of namespaces and potential membership in a different
container, record the transition from one container to another including all
namespaces if the latter are a different subset than the target container
initial set.
That is a fun one, isn't it? I think this is where the container
ID-to-definition mapping comes into play. If setns() changes the
process such that the existing container ID is no longer valid then we
need to do a new lookup in the table to see if another container ID is
valid; if no established container ID mappings are valid, the
container ID becomes "undefined".
--
paul moore
www.paul-moore.com