Re: [PATCH 2/9] Implement containers as kernel objects

Wednesday, 16 August 2017

On Mon, Aug 14, 2017 at 1:47 AM, Richard Guy Briggs <rgb(a)redhat.com&gt; wrote:
...
 Hi David,

 I wanted to respond to this thread to attempt some constructive feedback,
 better late than never.  I had a look at your fsopen/fsmount() patchset(s) to
 support this patchset which was interesting, but doesn't directly affect my
 work.  The primary patch of interest to the audit kernel folks (Paul Moore and
 me) is this patch while the rest of the patchset is interesting, but not likely
 to directly affect us.  This patch has most of what we need to solve our
 problem.

 Paul and I agree that audit is going to have a difficult time identifying
 containers or even namespaces without some change to the kernel.  The audit
 subsystem in the kernel needs at least a basic clue about which container
 caused an event to be able to report this at the appropriate level and ignore
 it at other levels to avoid a DoS. 
While there is some increased risk of "death by audit", this is really
only an issue once we start supporting multiple audit daemons; simply
associating auditable events with the container that triggered them
shouldn't add any additional overhead (I hope).  For a number of use
cases, a single auditd running outside the containers, but recording
all their events with some type of container attribution will be
sufficient.  This is step #1.

However, we will obviously want to go a bit further and support
multiple audit daemons on the system to allow containers to
record/process their own events (side note: the non-container auditd
instance will still see all the events).  There are a number of ways
we could tackle this, both via in-kernel and in-userspace record
routing, each with their own pros/cons.  However, how this works is
going to be dependent on how we identify containers and track their
audit events: the bits from step #1.  For this reason I'm not really
interested in worrying about the multiple auditd problem just yet;
it's obviously important, and something to keep in mind while working
up a solution, but it isn't something we should focus on right now.

...
 We also agree that there will need to be some sort of trigger from
userspace to
 indicate the creation of a container and its allocated resources and we're not
 really picky how that is done, such as a clone flag, a syscall or a sysfs write
 (or even a read, I suppose), but there will need to be some permission
 restrictions, obviously.  (I'd like to see capabilities used for this by adding
 a specific container bit to the capabilities bitmask.) 
To be clear, from an audit perspective I think the only thing we would
really care about controlling access to is the creation and assignment
of a new audit container ID/token, not necessarily the container
itself.  It's a small point, but an important one I think.

...
 I doubt we will be able to accomodate all definitions or concepts of
a
 container in a timely fashion.  We'll need to start somewhere with a minimum
 definition so that we can get traction and actually move forward before another
 compelling shared kernel microservice method leaves our entire community
 behind.  I'd like to declare that a container is a full set of cloned
 namespaces, but this is inefficient, overly constricting and unnecessary for
 our needs.  If we could agree on a minimum definition of a container (which may
 have only one specific cloned namespace) then we have something on which to
 build.  I could even see a container being defined by a trigger sent from
 userspace about a process (task) from which all its children are considered to
 be within that container, subject to further nesting. 
I really would prefer if we could avoid defining the term "container".
Even if we manage to get it right at this particular moment, we will
surely be made fools a year or two from now when things change.  At
the very least lets avoid a rigid definition of container, I'll
concede that we will probably need to have some definition simply so
we can implement something, I just don't want the design or
implementation to depend on a particular definition.

This comment is jumping ahead a bit, but from an audit perspective I
think we handle this by emitting an audit record whenever a container
ID is created which describes it as the kernel sees it; as of now that
probably means a list of namespace IDs.  Richard mentions this in his
email, I just wanted to make it clear that I think we should see this
as a flexible mechanism.  At the very least we will likely see a few
more namespaces before the world moves on from containers.

...
 In the simplest usable model for audit, if a container (definition
implies and)
 starts a PID namespace, then the container ID could simply be the container's
 "init" process PID in the initial PID namespace.  This assumes that as soon as
 that process vanishes, that entire container and all its children are killed
 off (which you've done).  There may be some container orchestration systems
 that don't use a unique PID namespace per container and that imposing this will
 cause them challenges. 
I don't follow how this would cause challenges if the containers do
not use a unique PID namespace; you are suggesting using the PID from
in the context of the initial PID namespace, yes?

Regardless, I do worry that using a PID could potentially be a bit
racy once we start jumping between kernel and userspace (audit
configuration, logs, etc.).

...
 If containers have at minimum a unique mount namespace then the root
path
 dentry inode device and inode number could be used, but there are likely better
 identifiers.  Again, there may be container orchestrators that don't use a
 unique mount namespace per container and that imposing this will cause
 challenges.

 I expect there are similar examples for each of the other namespaces. 
The PID case is a bit unique as each process is going to have a unique
PID regardless of namespaces, but even that has some drawbacks as
discussed above.  As for the other namespaces, I agree that we can't
rely on them (see my earlier comments).

...
 If we could pick one namespace type for consensus for which each
container has
 a unique instance of that namespace, we could use the dev/ino tuple from that
 namespace as had originally been suggested by Aristeu Rozanski more than 4
 years ago as part of the set of namespace IDs.  I had also attempted to
 solve this problem by using the namespace' proc inode, then switched over to
 generate a unique kernel serial number for each namespace and then went back to
 namespace proc dev/ino once Al Viro implemented nsfs:
         v1      https://lkml.org/lkml/2014/4/22/662
         v2      https://lkml.org/lkml/2014/5/9/637
         v3      https://lkml.org/lkml/2014/5/20/287
         v4      https://lkml.org/lkml/2014/8/20/844
         v5      https://lkml.org/lkml/2014/10/6/25
         v6      https://lkml.org/lkml/2015/4/17/48
         v7      https://lkml.org/lkml/2015/5/12/773

 These patches don't use a container ID, but track all namespaces in use for an
 event.  This has the benefit of punting this tracking to userspace for some
 other tool to analyse and determine to which container an event belongs.
 This will use a lot of bandwidth in audit log files when a single
 container ID that doesn't require nesting information to be complete
 would be a much more efficient use of audit log bandwidth. 
Relying on a particular namespace to identify a containers is a
non-starter from my perspective for all the reasons previously
discussed.

...
 If we rely only on the setting of arbitrary container names from
userspace,
 then we must provide a map or tree back to the initial audit domain for that
 running kernel to be able to differentiate between potentially identical
 container names assigned in a nested container system.  If we assign a
 container serial number sequentially (atomic64_inc) from the kernel on request
 from userspace like the sessionID and log the creation with all nsIDs and the
 parent container serial number and/or container name, the nesting is clear due
 to lack of ambiguity in potential duplicate names in nesting.  If a container
 serial number is used, the tree of inheritance of nested containers can be
 rebuilt from the audit records showing what containers were spawned from what
 parent. 
I believe we are going to need a container ID to container definition
(namespace, etc.) mapping mechanism regardless of if the container ID
is provided by userspace or a kernel generated serial number.  This
mapping should be recorded in the audit log when the container ID is
created/defined.

...
 As was suggested in one of the previous threads, if there are any
events not
 associated with a task (incoming network packets) we log the namespace ID and
 then only concern ourselves with its container serial number or container name
 once it becomes associated with a task at which point that tracking will be
 more important anyways. 
Agreed.  After all, a single namespace can be shared between multiple
containers.  For those security officers who need to track individual
events like this they will have the container ID mapping information
in the logs as well so they should be able to trace the unassociated
event to a set of containers.

...
 I'm not convinced that a userspace or kernel generated UUID is
that useful
 since they are large, not human readable and may not be globally unique given
 the "pets vs cattle" direction we are going with potentially identical
 conditions in hosts or containers spawning containers, but I see no need to
 restrict them. 
...
From a kernel perspective I think an int should suffice; after all,
you can't have more containers then you have processes.  If the
container engine requires something more complex, it can use the int
as input to its own mapping function.

...
 How do we deal with setns()?  Once it is determined that action is
permitted,
 given the new combinaiton of namespaces and potential membership in a different
 container, record the transition from one container to another including all
 namespaces if the latter are a different subset than the target container
 initial set. 
That is a fun one, isn't it?  I think this is where the container
ID-to-definition mapping comes into play.  If setns() changes the
process such that the existing container ID is no longer valid then we
need to do a new lookup in the table to see if another container ID is
valid; if no established container ID mappings are valid, the
container ID becomes "undefined".

-- 
paul moore
www.paul-moore.com

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [PATCH 2/9] Implement containers as kernel objects