RFC: Audit Kernel Container IDs

audit 2.7.8 released

ausearch --text : missing...

Richard Guy Briggs

Wednesday, 13 September 2017 Wed, 13 Sep '17

12:13 p.m.

Containers are a userspace concept. The kernel knows nothing of them. The Linux audit system needs a way to be able to track the container provenance of events and actions. Audit needs the kernel's help to do this. Since the concept of a container is entirely a userspace concept, a trigger signal from the userspace container orchestration system initiates this. This will define a point in time and a set of resources associated with a particular container with an audit container ID. The trigger is a pseudo filesystem (proc, since PID tree already exists) write of a u64 representing the container ID to a file representing a process that will become the first process in a new container. This might place restrictions on mount namespaces required to define a container, or at least careful checking of namespaces in the kernel to verify permissions of the orchestrator so it can't change its own container ID. A bind mount of nsfs may be necessary in the container orchestrator's mntNS. Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo filesystem to have this action permitted. At that time, record the child container's user-supplied 64-bit container identifier along with the child container's first process (which may become the container's "init" process) process ID (referenced from the initial PID namespace), all namespace IDs (in the form of a nsfs device number and inode number tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying op=$action field. Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid container ID present on an auditable action or event. Forked and cloned processes inherit their parent's container ID, referenced in the process' audit_context struct. Log the creation of every namespace, inheriting/adding its spawning process' containerID(s), if applicable. Include the spawning and spawned namespace IDs (device and inode number tuples). [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] Note: At this point it appears only network namespaces may need to track container IDs apart from processes since incoming packets may cause an auditable event before being associated with a process. Log the destruction of every namespace when it is no longer used by any process, include the namespace IDs (device and inode number tuples). [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) the parent and child namespace IDs for any changes to a process' namespaces. [setns(2)] Note: It may be possible to combine AUDIT_NS_* record formats and distinguish them with an op=$action field depending on the fields required for each message type. A process can be moved from one container to another by using the container assignment method outlined above a second time. When a container ceases to exist because the last process in that container has exited and hence the last namespace has been destroyed and its refcount dropping to zero, log the fact. (This latter is likely needed for certification accountability.) A container object may need a list of processes and/or namespaces. A namespace cannot directly migrate from one container to another but could be assigned to a newly spawned container. A namespace can be moved from one container to another indirectly by having that namespace used in a second process in another container and then ending all the processes in the first container. Feedback please. - RGB -- Richard Guy Briggs <rgb(a)redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635

Show replies by date

Carlos O'Donell

Wednesday, 13 September Wed, 13 Sep

2:33 p.m.

On 09/13/2017 12:13 PM, Richard Guy Briggs wrote:

...

Containers are a userspace concept. The kernel knows nothing of them.

I am looking at this RFC from a userspace perspective, particularly from the loader's point of view and the unshare syscall and the semantics that arise from the use of it. At a high level what you are doing is providing a way to group, without hierarchy, processes and namespaces. The processes can move between container's if they have CAP_CONTAINER_ADMIN and can open and write to a special proc file. * With unshare a thread may dissociate part of its execution context and therefore see a distinct mount namespace. When you say "process" in this particular RFC do you exclude the fact that a thread might be in a distinct container from the rest of the threads in the process?

...

The Linux audit system needs a way to be able to track the container provenance of events and actions. Audit needs the kernel's help to do this.

* Why does the Linux audit system need to tracker container provenance? - How does it help to provide better audit messages? - Is it be enough to list the namespace that a process occupies? * Why does it need the kernel's help? - Is there a race condition that is only fixable with kernel support? - Or is it easier with kernel help but not required? Providing background on these questions would help clarify the design requirements.

...

Since the concept of a container is entirely a userspace concept, a trigger signal from the userspace container orchestration system initiates this. This will define a point in time and a set of resources associated with a particular container with an audit container ID.

Please don't use the word 'signal', I suggest 'register' since you are writing to a filesystem.

...

The trigger is a pseudo filesystem (proc, since PID tree already exists) write of a u64 representing the container ID to a file representing a process that will become the first process in a new container. This might place restrictions on mount namespaces required to define a container, or at least careful checking of namespaces in the kernel to verify permissions of the orchestrator so it can't change its own container ID. A bind mount of nsfs may be necessary in the container orchestrator's mntNS. Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo filesystem to have this action permitted. At that time, record the child container's user-supplied 64-bit container identifier along with

What is a "child container?" Containers don't have any hierarchy. I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents your continued operation as we have today?

...

the child container's first process (which may become the container's "init" process) process ID (referenced from the initial PID namespace), all namespace IDs (in the form of a nsfs device number and inode number tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying op=$action field.

What kind of requirement is there on the first tid/pid registering the container ID? What if the 8th tid/pid does the registration? Would that mean that the first process of the container did not register? It seems like you are suggesting that the registration by the 8th tid/pid causes a cascading registration progress, registering all tid/pids in the same grouping? Is that true?

...

Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid container ID present on an auditable action or event. Forked and cloned processes inherit their parent's container ID, referenced in the process' audit_context struct.

So a cloned process with CLONE_NEWNS has the came container ID as the parent process that called clone, at least until the clone has time to change to a new container ID? Do you forsee any case where someone might need a semantic that is slightly different? For example wanting to set the container ID on clone?

...

Log the creation of every namespace, inheriting/adding its spawning process' containerID(s), if applicable. Include the spawning and spawned namespace IDs (device and inode number tuples). [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] Note: At this point it appears only network namespaces may need to track container IDs apart from processes since incoming packets may cause an auditable event before being associated with a process.

OK.

...

Log the destruction of every namespace when it is no longer used by any process, include the namespace IDs (device and inode number tuples). [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) the parent and child namespace IDs for any changes to a process' namespaces. [setns(2)] Note: It may be possible to combine AUDIT_NS_* record formats and distinguish them with an op=$action field depending on the fields required for each message type. A process can be moved from one container to another by using the container assignment method outlined above a second time.

OK.

...

When a container ceases to exist because the last process in that container has exited and hence the last namespace has been destroyed and its refcount dropping to zero, log the fact. (This latter is likely needed for certification accountability.) A container object may need a list of processes and/or namespaces.

OK.

...

A namespace cannot directly migrate from one container to another but could be assigned to a newly spawned container. A namespace can be moved from one container to another indirectly by having that namespace used in a second process in another container and then ending all the processes in the first container.

OK.

...

Feedback please.

-- Cheers, Carlos.

Richard Guy Briggs

Thursday, 14 September Thu, 14 Sep

12:30 a.m.

On 2017-09-13 14:33, Carlos O'Donell wrote:

...

On 09/13/2017 12:13 PM, Richard Guy Briggs wrote: > Containers are a userspace concept. The kernel knows nothing of them. I am looking at this RFC from a userspace perspective, particularly from the loader's point of view and the unshare syscall and the semantics that arise from the use of it. At a high level what you are doing is providing a way to group, without hierarchy, processes and namespaces. The processes can move between container's if they have CAP_CONTAINER_ADMIN and can open and write to a special proc file. * With unshare a thread may dissociate part of its execution context and therefore see a distinct mount namespace. When you say "process" in this particular RFC do you exclude the fact that a thread might be in a distinct container from the rest of the threads in the process? > The Linux audit system needs a way to be able to track the container > provenance of events and actions. Audit needs the kernel's help to do > this. * Why does the Linux audit system need to tracker container provenance?

- ability to filter unwanted, irrelevant or unimportant messages before they fill queue so important messages don't get lost. This is a certification requirement. - ability to make security claims about containers, require tracking of actions within those containers to ensure compliance with established security policies. - ability to route messages from events to relevant audit daemon instance or host audit daemon instance or both, as required or determined by user-initiated rules

...

- How does it help to provide better audit messages? - Is it be enough to list the namespace that a process occupies?

We started with that approach back more than 4 years ago and found it helped, but didn't go far enough in terms of quick and inexpensive record filtering and left some doubt about provenance of events in the case of non-user context events (incoming network packets).

...

* Why does it need the kernel's help? - Is there a race condition that is only fixable with kernel support?

This was a concern, but relatively minor compared with the other benefits.

...

- Or is it easier with kernel help but not required?

It is much easier and much less expensive.

...

Providing background on these questions would help clarify the design requirements.

Here are some references that should help provide some background: https://github.com/linux-audit/audit-kernel/issues/32 RFE: add namespace IDs to audit records https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualizati... SPEC Virtualization Manager Guest Lifecycle Events https://lwn.net/Articles/699819/ Audit, namespaces, and containers https://lwn.net/Articles/723561/ Containers as kernel objects (my reply, with references: https://lkml.org/lkml/2017/8/14/15 ) https://bugzilla.redhat.com/show_bug.cgi?id=1045666 audit: add namespace IDs to log records

...

> Since the concept of a container is entirely a userspace concept, a > trigger signal from the userspace container orchestration system > initiates this. This will define a point in time and a set of resources > associated with a particular container with an audit container ID. Please don't use the word 'signal', I suggest 'register' since you are writing to a filesystem.

Ok, that's a very reasonable request. 'signal' has a previous meaning.

...

> The trigger is a pseudo filesystem (proc, since PID tree already exists) > write of a u64 representing the container ID to a file representing a > process that will become the first process in a new container. > This might place restrictions on mount namespaces required to define a > container, or at least careful checking of namespaces in the kernel to > verify permissions of the orchestrator so it can't change its own > container ID. > A bind mount of nsfs may be necessary in the container orchestrator's > mntNS. > > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo > filesystem to have this action permitted. At that time, record the > child container's user-supplied 64-bit container identifier along with What is a "child container?" Containers don't have any hierarchy.

Maybe some don't, but that's not likely to last long given the abstraction and nesting of orchestration tools. This must be nestable.

...

I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents your continued operation as we have today?

Correct. It won't prevent processes that otherwise have permissions today from creating all the namespaces it wishes.

...

> the child container's first process (which may become the container's > "init" process) process ID (referenced from the initial PID namespace), > all namespace IDs (in the form of a nsfs device number and inode number > tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying > op=$action field. What kind of requirement is there on the first tid/pid registering the container ID? What if the 8th tid/pid does the registration? Would that mean that the first process of the container did not register? It seems like you are suggesting that the registration by the 8th tid/pid causes a cascading registration progress, registering all tid/pids in the same grouping? Is that true?

Ah, good question, I forgot to address that fact. The intent is that either threaded processes after initiating threading will not have permission to execute this, or all the processes in the thread group will be forced into the same container. I don't have a strong opinion on whether or not it must be the lead thread process that must be the one to receive that registration, but I suspect that would be wise.

...

> Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid > container ID present on an auditable action or event. > > Forked and cloned processes inherit their parent's container ID, > referenced in the process' audit_context struct. So a cloned process with CLONE_NEWNS has the came container ID as the parent process that called clone, at least until the clone has time to change to a new container ID?

Yes.

...

Do you forsee any case where someone might need a semantic that is slightly different? For example wanting to set the container ID on clone?

I could envision that situation and I think that might be workable but for the synchronicity of having one initiated by a specific syscall and the other initiated by a /proc write.

...

> Log the creation of every namespace, inheriting/adding its spawning > process' containerID(s), if applicable. Include the spawning and > spawned namespace IDs (device and inode number tuples). > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] > Note: At this point it appears only network namespaces may need to track > container IDs apart from processes since incoming packets may cause an > auditable event before being associated with a process. OK. > Log the destruction of every namespace when it is no longer used by any > process, include the namespace IDs (device and inode number tuples). > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) > the parent and child namespace IDs for any changes to a process' > namespaces. [setns(2)] > Note: It may be possible to combine AUDIT_NS_* record formats and > distinguish them with an op=$action field depending on the fields > required for each message type. > > A process can be moved from one container to another by using the > container assignment method outlined above a second time. OK. > When a container ceases to exist because the last process in that > container has exited and hence the last namespace has been destroyed and > its refcount dropping to zero, log the fact. > (This latter is likely needed for certification accountability.) A > container object may need a list of processes and/or namespaces. OK. > A namespace cannot directly migrate from one container to another but > could be assigned to a newly spawned container. A namespace can be > moved from one container to another indirectly by having that namespace > used in a second process in another container and then ending all the > processes in the first container. OK. > Feedback please.

Thank you sir!

...

Carlos.

- RGB -- Richard Guy Briggs <rgb(a)redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635

Richard Guy Briggs

Friday, 15 September Fri, 15 Sep

5:19 a.m.

On 2017-09-14 01:30, Richard Guy Briggs wrote:

...

On 2017-09-13 14:33, Carlos O'Donell wrote: > On 09/13/2017 12:13 PM, Richard Guy Briggs wrote: > > Containers are a userspace concept. The kernel knows nothing of them. > > I am looking at this RFC from a userspace perspective, particularly from > the loader's point of view and the unshare syscall and the semantics that > arise from the use of it. > > At a high level what you are doing is providing a way to group, without > hierarchy, processes and namespaces. The processes can move between > container's if they have CAP_CONTAINER_ADMIN and can open and write to > a special proc file.

I should clarify: It wasn't intended that a process can see or modify its own or a peer's special proc container file to be able to set it or discover its value. This was only meant for its orchestrator or delegated agents to do. This can't be left only to CAP_CONTAINER_ADMIN. This may require a container to have its own mount namespace if the trigger mechanism is a proc file write. Other methods (additional namespaces?) may be needed to restrict it for other trigger methods (syscall?).

...

> * With unshare a thread may dissociate part of its execution context and > therefore see a distinct mount namespace. When you say "process" in this > particular RFC do you exclude the fact that a thread might be in a > distinct container from the rest of the threads in the process? > > > The Linux audit system needs a way to be able to track the container > > provenance of events and actions. Audit needs the kernel's help to do > > this. > > * Why does the Linux audit system need to tracker container provenance? - ability to filter unwanted, irrelevant or unimportant messages before they fill queue so important messages don't get lost. This is a certification requirement. - ability to make security claims about containers, require tracking of actions within those containers to ensure compliance with established security policies. - ability to route messages from events to relevant audit daemon instance or host audit daemon instance or both, as required or determined by user-initiated rules > - How does it help to provide better audit messages? > > - Is it be enough to list the namespace that a process occupies? We started with that approach back more than 4 years ago and found it helped, but didn't go far enough in terms of quick and inexpensive record filtering and left some doubt about provenance of events in the case of non-user context events (incoming network packets). > * Why does it need the kernel's help? > > - Is there a race condition that is only fixable with kernel support? This was a concern, but relatively minor compared with the other benefits. > - Or is it easier with kernel help but not required? It is much easier and much less expensive. > Providing background on these questions would help clarify the > design requirements. Here are some references that should help provide some background: https://github.com/linux-audit/audit-kernel/issues/32 RFE: add namespace IDs to audit records https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualizati... SPEC Virtualization Manager Guest Lifecycle Events https://lwn.net/Articles/699819/ Audit, namespaces, and containers https://lwn.net/Articles/723561/ Containers as kernel objects (my reply, with references: https://lkml.org/lkml/2017/8/14/15 ) https://bugzilla.redhat.com/show_bug.cgi?id=1045666 audit: add namespace IDs to log records > > Since the concept of a container is entirely a userspace concept, a > > trigger signal from the userspace container orchestration system > > initiates this. This will define a point in time and a set of resources > > associated with a particular container with an audit container ID. > > Please don't use the word 'signal', I suggest 'register' since you are > writing to a filesystem. Ok, that's a very reasonable request. 'signal' has a previous meaning. > > The trigger is a pseudo filesystem (proc, since PID tree already exists) > > write of a u64 representing the container ID to a file representing a > > process that will become the first process in a new container. > > This might place restrictions on mount namespaces required to define a > > container, or at least careful checking of namespaces in the kernel to > > verify permissions of the orchestrator so it can't change its own > > container ID. > > A bind mount of nsfs may be necessary in the container orchestrator's > > mntNS. > > > > Require a new CAP_CONTAINER_ADMIN to be able to write to the pseudo > > filesystem to have this action permitted. At that time, record the > > child container's user-supplied 64-bit container identifier along with > > What is a "child container?" Containers don't have any hierarchy. Maybe some don't, but that's not likely to last long given the abstraction and nesting of orchestration tools. This must be nestable.

This is why we can't rely only on CAP_CONTAINER_ADMIN to restrict the ability for self-modification or self-discovery.

...

> I assume that if you don't have CAP_CONTAINER_ADMIN, that nothing prevents > your continued operation as we have today? Correct. It won't prevent processes that otherwise have permissions today from creating all the namespaces it wishes. > > the child container's first process (which may become the container's > > "init" process) process ID (referenced from the initial PID namespace), > > all namespace IDs (in the form of a nsfs device number and inode number > > tuple) in a new auxilliary record AUDIT_CONTAINER with a qualifying > > op=$action field. > > What kind of requirement is there on the first tid/pid registering > the container ID? What if the 8th tid/pid does the registration? > Would that mean that the first process of the container did not > register? It seems like you are suggesting that the registration > by the 8th tid/pid causes a cascading registration progress, > registering all tid/pids in the same grouping? Is that true? Ah, good question, I forgot to address that fact. The intent is that either threaded processes after initiating threading will not have permission to execute this, or all the processes in the thread group will be forced into the same container. I don't have a strong opinion on whether or not it must be the lead thread process that must be the one to receive that registration, but I suspect that would be wise. > > Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid > > container ID present on an auditable action or event. > > > > Forked and cloned processes inherit their parent's container ID, > > referenced in the process' audit_context struct. > > So a cloned process with CLONE_NEWNS has the came container ID > as the parent process that called clone, at least until the clone > has time to change to a new container ID? Yes.

And as pointed to above, it isn't the process itself that is able to change to a new container, but its orchestrator to move/assign it.

...

> Do you forsee any case where someone might need a semantic that is > slightly different? For example wanting to set the container ID on > clone? I could envision that situation and I think that might be workable but for the synchronicity of having one initiated by a specific syscall and the other initiated by a /proc write.

The ability to clone while providing a containerID would work really well, but I'm hesitant to extend or duplicate the clone call. This actually sounds like a potentially sane way of approaching it.

...

> > Log the creation of every namespace, inheriting/adding its spawning > > process' containerID(s), if applicable. Include the spawning and > > spawned namespace IDs (device and inode number tuples). > > [AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)] > > Note: At this point it appears only network namespaces may need to track > > container IDs apart from processes since incoming packets may cause an > > auditable event before being associated with a process. > > OK. > > > Log the destruction of every namespace when it is no longer used by any > > process, include the namespace IDs (device and inode number tuples). > > [AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)] > > > > Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action) > > the parent and child namespace IDs for any changes to a process' > > namespaces. [setns(2)] > > Note: It may be possible to combine AUDIT_NS_* record formats and > > distinguish them with an op=$action field depending on the fields > > required for each message type. > > > > A process can be moved from one container to another by using the > > container assignment method outlined above a second time. > > OK. > > > When a container ceases to exist because the last process in that > > container has exited and hence the last namespace has been destroyed and > > its refcount dropping to zero, log the fact. > > (This latter is likely needed for certification accountability.) A > > container object may need a list of processes and/or namespaces. > > OK. > > > A namespace cannot directly migrate from one container to another but > > could be assigned to a newly spawned container. A namespace can be > > moved from one container to another indirectly by having that namespace > > used in a second process in another container and then ending all the > > processes in the first container. > > OK. > > > Feedback please. Thank you sir! > Carlos. - RGB -- Richard Guy Briggs <rgb(a)redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635 -- Linux-audit mailing list Linux-audit(a)redhat.com https://www.redhat.com/mailman/listinfo/linux-audit

Eric W. Biederman

Thursday, 14 September Thu, 14 Sep

12:33 p.m.

Richard Guy Briggs <rgb(a)redhat.com> writes:

...

Why a u64? Why a proc filesystem write and not a magic audit message? I don't like the fact that the proc filesystem entry is likely going to be readable and abusable by non-audit contexts? Why the ability to change the containerid? What is the use case you are thinking of there? Eric

Richard Guy Briggs

1:07 p.m.

On 2017-09-14 12:33, Eric W. Biederman wrote:

...

Richard Guy Briggs <rgb(a)redhat.com> writes: > The trigger is a pseudo filesystem (proc, since PID tree already exists) > write of a u64 representing the container ID to a file representing a > process that will become the first process in a new container. > This might place restrictions on mount namespaces required to define a > container, or at least careful checking of namespaces in the kernel to > verify permissions of the orchestrator so it can't change its own > container ID. Why a u64?

u32 will roll too quickly. UUID is large enough that it adds significantly to audit record bandwidth. I'd prefer u64, but can look at the difference of accommodating a UUID...

...

Why a proc filesystem write and not a magic audit message?

A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use sparingly. Given that orchestrators will already require it to send the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden. I was originally leaning towards an audit message trigger or a syscall.

...

I don't like the fact that the proc filesystem entry is likely going to be readable and abusable by non-audit contexts?

This proposal wasn't going to start with that link being readable, but its filesystem structure and link names would be, perhaps giving away too much already. I think we will need to find a way for the orchestrator or one of its authorized agents to read this information while blocking reads from unauthorized agents, otherwise this would be of very limited use.

...

Why the ability to change the containerid? What is the use case you are thinking of there?

This was covered in the end of the conversation with Paul Moore (that maybe you got tired reading?) I'd originally proposed having it write once, but Paul figured there was no good reason to restrict it and leave that decision up to the orchestrator. The use case would be adding other processes to a container, but it could be argued all additional processes should be spawned by the first process in a container.

...

Eric

Eric W. Biederman

Monday, 18 September Mon, 18 Sep

9:45 p.m.

Richard Guy Briggs <rgb(a)redhat.com> writes:

...

On 2017-09-14 12:33, Eric W. Biederman wrote: > Richard Guy Briggs <rgb(a)redhat.com> writes: > > > The trigger is a pseudo filesystem (proc, since PID tree already exists) > > write of a u64 representing the container ID to a file representing a > > process that will become the first process in a new container. > > This might place restrictions on mount namespaces required to define a > > container, or at least careful checking of namespaces in the kernel to > > verify permissions of the orchestrator so it can't change its own > > container ID. > > Why a u64? u32 will roll too quickly. UUID is large enough that it adds significantly to audit record bandwidth. I'd prefer u64, but can look at the difference of accommodating a UUID...

I was imagining a string might be better. As for the purposes of audit it is just a byte string you regurgitate.

...

> Why a proc filesystem write and not a magic audit message? A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use sparingly. Given that orchestrators will already require it to send the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden. I was originally leaning towards an audit message trigger or a syscall. > I don't like the fact that the proc filesystem entry is likely going to > be readable and abusable by non-audit contexts? This proposal wasn't going to start with that link being readable, but its filesystem structure and link names would be, perhaps giving away too much already. I think we will need to find a way for the orchestrator or one of its authorized agents to read this information while blocking reads from unauthorized agents, otherwise this would be of very limited use.

Something that is set only for future audit messages seems reasonable. Once you start reading this from something other than audit messages I get neverous, that people will use this beyond audit for things it is not intended for.

...

> Why the ability to change the containerid? What is the use case you are > thinking of there? This was covered in the end of the conversation with Paul Moore (that maybe you got tired reading?)

I have not had time to review everything. As I was busy preparing for my wedding and am now in the middle of my honeymoon.

...

I'd originally proposed having it write once, but Paul figured there was no good reason to restrict it and leave that decision up to the orchestrator. The use case would be adding other processes to a container, but it could be argued all additional processes should be spawned by the first process in a container.

I see two cases here: a) Nested containers b) Inject processes via something like nsenter into a container. In case a) you have to figure out what to do with nested containers and that does seem to be a legitimate case for a double write. Arguably with the restriction that you must specify a more nested label. In case b) which you seem to be referring to it would be a process created by the container manager outside the container that has no container label. At which point there is not a need for a double write. So my recommendation is to not support double writes until you support nested containers. Eric

Richard Guy Briggs

11:15 p.m.

On 2017-09-18 21:45, Eric W. Biederman wrote:

...

Richard Guy Briggs <rgb(a)redhat.com> writes: > On 2017-09-14 12:33, Eric W. Biederman wrote: >> Richard Guy Briggs <rgb(a)redhat.com> writes: >> >> > The trigger is a pseudo filesystem (proc, since PID tree already exists) >> > write of a u64 representing the container ID to a file representing a >> > process that will become the first process in a new container. >> > This might place restrictions on mount namespaces required to define a >> > container, or at least careful checking of namespaces in the kernel to >> > verify permissions of the orchestrator so it can't change its own >> > container ID. >> >> Why a u64? > > u32 will roll too quickly. UUID is large enough that it adds > significantly to audit record bandwidth. I'd prefer u64, but can look > at the difference of accommodating a UUID... I was imagining a string might be better. As for the purposes of audit it is just a byte string you regurgitate.

Yes, so looking at u128 vs dhowells' proposal, it would be 16 bytes vs 24 bytes, which really isn't that much difference... What length of string length were you envisioning?

...

>> Why a proc filesystem write and not a magic audit message? > > A magic audit message requires CAP_AUDIT_WRITE, which we'd like to use > sparingly. Given that orchestrators will already require it to send > the mandatory AUDIT_VIRT_*, this doesn't seem like an unreasonable burden. > > I was originally leaning towards an audit message trigger or a syscall. > >> I don't like the fact that the proc filesystem entry is likely going to >> be readable and abusable by non-audit contexts? > > This proposal wasn't going to start with that link being readable, but > its filesystem structure and link names would be, perhaps giving away > too much already. > > I think we will need to find a way for the orchestrator or one of its > authorized agents to read this information while blocking reads from > unauthorized agents, otherwise this would be of very limited use. Something that is set only for future audit messages seems reasonable. Once you start reading this from something other than audit messages I get neverous, that people will use this beyond audit for things it is not intended for.

Understandably. At the same time, if we implement something that is more broadly useful and solves a number of other challenges others are facing, how can we make it available while limiting the potential for abuse?

...

>> Why the ability to change the containerid? What is the use case you are >> thinking of there? > > This was covered in the end of the conversation with Paul Moore (that > maybe you got tired reading?) I have not had time to review everything. As I was busy preparing for my wedding and am now in the middle of my honeymoon.

I'm very sorry, my bad! You had given me a heads up about this and I appologise for causing a stir during your special time.

...

> I'd originally proposed having it write > once, but Paul figured there was no good reason to restrict it and leave > that decision up to the orchestrator. The use case would be adding > other processes to a container, but it could be argued all additional > processes should be spawned by the first process in a container. I see two cases here: a) Nested containers b) Inject processes via something like nsenter into a container. In case a) you have to figure out what to do with nested containers and that does seem to be a legitimate case for a double write. Arguably with the restriction that you must specify a more nested label.

Is this technically a double write if it is an inheritance? That should be solvable with a flag.

...

In case b) which you seem to be referring to it would be a process created by the container manager outside the container that has no container label. At which point there is not a need for a double write.

Looking at the potential for nesting, if the orchestrator is already in a container, then it would already have a label, but if we refer to the flag solution above, then it is still the first write.

...

So my recommendation is to not support double writes until you support nested containers.

I think this is a reasonable restriction. Thanks for your time. Sorry to disturb your holiday.

...

Eric

3099

days inactive

3105

days old

linux-audit@lists.linux-audit.osci.io

Manage subscription

7 comments

3 participants

tags (0)

participants (3)

Carlos O'Donell
Eric W. Biederman
Richard Guy Briggs

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

RFC: Audit Kernel Container IDs