Re: RFC: Audit Kernel Container IDs

Wednesday, 13 September 2017

On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:
...
 Containers are a userspace concept.  The kernel knows nothing of
them. 
I am looking at this RFC from a userspace perspective, particularly from
the loader's point of view and the unshare syscall and the semantics that
arise from the use of it.

At a high level what you are doing is providing a way to group, without
hierarchy, processes and namespaces. The processes can move between
container's if they have CAP_CONTAINER_ADMIN and can open and write to
a special proc file.

* With unshare a thread may dissociate part of its execution context and
  therefore see a distinct mount namespace. When you say "process" in this
  particular RFC do you exclude the fact that a thread might be in a
  distinct container from the rest of the threads in the process?

...
 The Linux audit system needs a way to be able to track the container
 provenance of events and actions.  Audit needs the kernel's help to do
 this. 
* Why does the Linux audit system need to tracker container provenance?

  - How does it help to provide better audit messages?

  - Is it be enough to list the namespace that a process occupies?

* Why does it need the kernel's help?

  - Is there a race condition that is only fixable with kernel support?

  - Or is it easier with kernel help but not required?

Providing background on these questions would help clarify the
design requirements.

...
 Since the concept of a container is entirely a userspace concept, a
 trigger signal from the userspace container orchestration system
 initiates this.  This will define a point in time and a set of resources
 associated with a particular container with an audit container ID. 
Please don't use the word 'signal', I suggest 'register' since you
are
writing to a filesystem.

...
 The trigger is a pseudo filesystem (proc, since PID tree already
exists)
 write of a u64 representing the container ID to a file representing a
 process that will become the first process in a new container.
 This might place restrictions on mount namespaces required to define a
 container, or at least careful checking of namespaces in the kernel to
 verify permissions of the orchestrator so it can't change its own
 container ID.
 A bind mount of nsfs may be necessary in the container orchestrator's
 mntNS.

 Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo
 filesystem to have this action permitted.  At that time, record the
 child container's user-supplied 64-bit container identifier along with 
What is a "child container?" Containers don't have any hierarchy.

I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents
your continued operation as we have today?

...
 the child container's first process (which may become the
container's
 "init" process) process ID (referenced from the initial PID namespace),
 all namespace IDs (in the form of a nsfs device number and inode number
 tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying
 op=$action field. 
What kind of requirement is there on the first tid/pid registering
the container ID? What if the 8th tid/pid does the registration?
Would that mean that the first process of the container did not
register? It seems like you are suggesting that the registration
by the 8th tid/pid causes a cascading registration progress,
registering all tid/pids in the same grouping? Is that true?

...
 Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
 container ID present on an auditable action or event.

 Forked and cloned processes inherit their parent's container ID,
 referenced in the process' audit_context struct. 
So a cloned process with CLONE_NEWNS has the came container ID
as the parent process that called clone, at least until the clone
has time to change to a new container ID?

Do you forsee any case where someone might need a semantic that is
slightly different? For example wanting to set the container ID on
clone?

...
 Log the creation of every namespace, inheriting/adding its spawning
 process' containerID(s), if applicable.  Include the spawning and
 spawned namespace IDs (device and inode number tuples).
 [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
 Note: At this point it appears only network namespaces may need to track
 container IDs apart from processes since incoming packets may cause an
 auditable event before being associated with a process. 
OK.

...
 Log the destruction of every namespace when it is no longer used by
any
 process, include the namespace IDs (device and inode number tuples).
 [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]

 Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
 the parent and child namespace IDs for any changes to a process'
 namespaces. [setns(2)]
 Note: It may be possible to combine AUDIT_NS_* record formats and
 distinguish them with an op=$action field depending on the fields
 required for each message type.

 A process can be moved from one container to another by using the
 container assignment method outlined above a second time. 
OK.

...
 When a container ceases to exist because the last process in that
 container has exited and hence the last namespace has been destroyed and
 its refcount dropping to zero, log the fact.
 (This latter is likely needed for certification accountability.)  A
 container object may need a list of processes and/or namespaces. 
OK.

...
 A namespace cannot directly migrate from one container to another
but
 could be assigned to a newly spawned container.  A namespace can be
 moved from one container to another indirectly by having that namespace
 used in a second process in another container and then ending all the
 processes in the first container. 
OK.

...
 Feedback please. 
-- 
Cheers,
Carlos.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: RFC: Audit Kernel Container IDs