On 15/04/22, Richard Guy Briggs wrote:
On 15/04/20, Eric W. Biederman wrote:
> Richard Guy Briggs <rgb(a)redhat.com> writes:
>
> > The purpose is to track namespace instances in use by logged processes from
the
> > perspective of init_*_ns by logging the namespace IDs (device ID and namespace
> > inode - offset).
>
> In broad strokes the user interface appears correct.
>
> Things that I see that concern me:
>
> - After Als most recent changes these inodes no longer live in the proc
> superblock so the device number reported in these patches is
> incorrect.
Ok, found the patchset you're talking about:
3d3d35b kill proc_ns completely
e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs
f77c801 bury struct proc_ns in fs/proc
33c4294 copy address of proc_ns_ops into ns_common
6344c43 new helpers: ns_alloc_inum/ns_free_inum
6496452 make proc_ns_operations work with struct ns_common * instead of void *
3c04118 switch the rest of proc_ns_operations to working with &...->ns
ff24870 netns: switch ->get()/->put()/->install()/->inum() to working with
&net->ns
58be2825 make mntns ->get()/->put()/->install()/->inum() work with
&mnt_ns->ns
435d5f4 common object embedded into various struct ....ns
Ok, I've got some minor jigging to do to get inum too...
Do I even need to report the device number anymore since I am concluding
s_dev is never set (or always zero) in the nsfs filesystem by
mount_pseudo() and isn't even mountable? In fact, I never needed to
report the device since proc ida/idr and inodes are kernel-global and
namespace-oblivious.
> - I am nervous about audit logs being flooded with users
creating lots
> of namespaces. But that is more your lookout than mine.
There was a thought to create a filter to en/disable this logging...
It is an auxiliary record to syscalls, so they can be ignored by userspace tools.
> - unshare is not logging when it creates new namespaces.
They are all covered:
sys_unshare > unshare_userns > create_user_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_mnt_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_utsname
> clone_uts_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_ipcs
> get_ipc_ns
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_pid_ns
> create_pid_namespace
sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces > copy_net_ns
> As small numbers are nice and these inodes all live in their own
> superblock now we should be able to remove the games with
> PROC_DYNAMIC_FIRST and just use small numbers for these inodes
> everywhere.
That is compelling if I can untangle the proc inode allocation code from the
ida/idr. Should be as easy as defining a new ns_alloc_inum (and ns_free_inum)
to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock,
then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO in
the place of the existing PROC_*_INIT_INO.
> I have answered your comments below.
More below...
> > 1/10 exposes proc's ns entries structure which lists a number of useful
> > operations per namespace type for other subsystems to use.
> >
> > 2/10 proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
> >
> > 3/10 provides an example of usage for audit_log_task_info() which is used by
> > syscall audits, among others. audit_log_task() and
audit_common_recv_message()
> > would be other potential use cases.
> >
> > Proposed output format:
> > This differs slightly from Aristeu's patch because of the label conflict
with
> > "pid=" due to including it in existing records rather than it being a
seperate
> > record. It has now returned to being a seperate record. The proc device
> > major/minor are listed in hexadecimal and namespace IDs are the proc inode
> > minus the base offset.
> > type=NS_INFO msg=audit(1408577535.306:82): dev=00:03 netns=3 utsns=-3 ipcns=-4
pidns=-1 userns=-2 mntns=0
> >
> > 4/10 change audit startup from __initcall to subsys_initcall to get it started
> > earlier to be able to receive initial namespace log messages.
> >
> > 5/10 tracks the creation and deletion of namespaces, listing the type of
> > namespace instance, proc device ID, related namespace id if there is one and
> > the newly minted namespace ID.
> >
> > Proposed output format for initial namespace creation:
> > type=AUDIT_NS_INIT_UTS msg=audit(1408577534.868:5): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_utsns=(none) utsns=-3 res=1
> > type=AUDIT_NS_INIT_USER msg=audit(1408577534.868:6): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_userns=(none) userns=-2 res=1
> > type=AUDIT_NS_INIT_PID msg=audit(1408577534.868:7): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_pidns=(none) pidns=-1 res=1
> > type=AUDIT_NS_INIT_MNT msg=audit(1408577534.868:8): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_mntns=(none) mntns=0 res=1
> > type=AUDIT_NS_INIT_IPC msg=audit(1408577534.868:9): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_ipcns=(none) ipcns=-4 res=1
> > type=AUDIT_NS_INIT_NET msg=audit(1408577533.500:10): pid=1 uid=0
auid=4294967295 ses=4294967295 subj=kernel dev=00:03 old_netns=(none) netns=2 res=1
> >
> > And a CLONE action would result in:
> > type=type=AUDIT_NS_INIT_NET msg=audit(1408577535.306:81): pid=481 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 old_netns=2
netns=3 res=1
> >
> > While deleting a namespace would result in:
> > type=type=AUDIT_NS_DEL_MNT msg=audit(1408577552.221:85): pid=481 uid=0
auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 dev=00:03 mntns=4 res=1
> >
> > 6/10 accepts a PID from userspace and requests logging an AUDIT_NS_INFO record
> > type (CAP_AUDIT_CONTROL required).
> >
> > 7/10 is a macro for CLONE_NEW_* flags.
> >
> > 8/10 adds auditing on creation of namespace(s) in fork.
> >
> > 9/10 adds auditing a change of namespace on setns.
> >
> > 10/10 attaches a AUDIT_NS_INFO record to AUDIT_VIRT_CONTROL records
> > (CAP_AUDIT_WRITE required).
> >
> >
> > v5 -> v6:
> > Switch to using namespace ID based on namespace proc inode minus base offset
> > Added proc device ID to qualify proc inode reference
> > Eliminate exposed /proc interface
> >
> > v4 -> v5:
> > Clean up prototypes for dependencies on CONFIG_NAMESPACES.
> > Add AUDIT_NS_INFO record type to AUDIT_VIRT_CONTROL record.
> > Log AUDIT_NS_INFO with PID.
> > Move /proc/<pid>/ns_* patches to end of patchset to deprecate them.
> > Log on changing ns (setns).
> > Log on creating new namespaces when forking.
> > Added a macro for CLONE_NEW*.
> >
> > v3 -> v4:
> > Seperate out the NS_INFO message from the SYSCALL message.
> > Moved audit_log_namespace_info() out of audit_log_task_info().
> > Use a seperate message type per namespace type for each of INIT/DEL.
> > Make ns= easier to search across NS_INFO and NS_INIT/DEL_XXX msg types.
> > Add /proc/<pid>/ns/ documentation.
> > Fix dynamic initial ns logging.
> >
> > v2 -> v3:
> > Use atomic64_t in ns_serial to simplify it.
> > Avoid funciton duplication in proc, keying on dentry.
> > Squash down audit patch to avoid rcu sleep issues.
> > Add tracking for creation and deletion of namespace instances.
> >
> > v1 -> v2:
> > Avoid rollover by switching from an int to a long long.
> > Change rollover behaviour from simply avoiding zero to raising a BUG.
> > Expose serial numbers in /proc/<pid>/ns/*_snum.
> > Expose ns_entries and use it in audit.
> >
> >
> > Notes:
> > As for CAP_AUDIT_READ, a patchset has been accepted upstream to check
> > capabilities of userspace processes that try to join netlink broadcast groups.
> >
> > This set does not try to solve the non-init namespace audit messages and
> > auditd problem yet. That will come later, likely with additional auditd
> > instances running in another namespace with a limited ability to influence the
> > master auditd. I echo Eric B's idea that messages destined for different
> > namespaces would have to be tailored for that namespace with references that
> > make sense (such as the right pid number reported to that pid namespace, and
> > not leaking info about parents or peers).
> >
> > Questions:
> > Is there a way to link serial numbers of namespaces involved in migration of a
> > container to another kernel? It sounds like what is needed is a part of a
> > mangement application that is able to pull the audit records from constituent
> > hosts to build an audit trail of a container.
>
> I honestly don't know how much we are going to care about namespace ids
> during migration. So far this is not a problem that has come up.
Not for CRIU, but it will be an issue for a container auditor that aggregates
information from individually auditted hosts.
> I don't think migration becomes a practical concern (other than
> interface wise) until achieve a non-init namespace auditd. The easy way
> to handle migration would be to log a setns of every process from their
> old namespaces to their new namespaces. As you appear to have a setns
> event defined.
Again, this would be taken care of by a layer above that is container-aware
across multiple hosts.
> How to handle the more general case beyond audit remains unclear. I
> think it will be a little while yet before we start dealing with
> migrating applications that care. When we do we will either need to
> generate some kind of hot-plug event that userspace can respond to and
> discover all of the appropriate file-system nodes have changed, or we
> will need to build a mechanism in the kernel to preserve these numbers.
I don't expect to need to preserve these numbers. The higher layer application
will be able to do that translation.
> I really don't know which solution we will wind up with in the kernel at
> this point.
>
> > What additional events should list this information?
>
> At least unshare.
Already covered as noted above. If it is a brand new namespace, it will show
the old one as "(none)" (or maybe zero now that we are looking at renumbering
the NS inodes). If it is an unshared one, it will show the old one from which
it was unshared.
> > Does this present any problematic information leaks? Only CAP_AUDIT_CONTROL
> > (and now CAP_AUDIT_READ) in init_user_ns can get to this information in
> > the init namespace at the moment from audit.
>
> Good question. Today access to this information is generally guarded
> with CAP_SYS_PTRACE.
>
> I suspect for some of audits tracing features like this one we should
> also use CAP_SYS_PTRACE so that we have a consistent set of checks for
> getting information about applications.
I assume CAP_SYS_PTRACE is orthogonal to CAP_AUDIT_{CONTROL,READ} and that
CAP_SYS_PTRACE would need to be insufficient to get that information.
Thanks for your thoughtful feedback, Eric.
> Eric
>
> > Richard Guy Briggs (10):
> > namespaces: expose ns_entries
> > proc_ns: define PROC_*_INIT_INO in terms of PROC_DYNAMIC_FIRST
> > audit: log namespace ID numbers
> > audit: initialize at subsystem time rather than device time
> > audit: log creation and deletion of namespace instances
> > audit: dump namespace IDs for pid on receipt of AUDIT_NS_INFO
> > sched: add a macro to ref all CLONE_NEW* flags
> > fork: audit on creation of new namespace(s)
> > audit: log on switching namespace (setns)
> > audit: emit AUDIT_NS_INFO record with AUDIT_VIRT_CONTROL record
> >
> > fs/namespace.c | 13 +++
> > fs/proc/generic.c | 3 +-
> > fs/proc/namespaces.c | 2 +-
> > include/linux/audit.h | 20 +++++
> > include/linux/proc_ns.h | 10 ++-
> > include/uapi/linux/audit.h | 21 +++++
> > include/uapi/linux/sched.h | 6 ++
> > ipc/namespace.c | 12 +++
> > kernel/audit.c | 169
+++++++++++++++++++++++++++++++++++++-
> > kernel/auditsc.c | 2 +
> > kernel/fork.c | 3 +
> > kernel/nsproxy.c | 4 +
> > kernel/pid_namespace.c | 13 +++
> > kernel/user_namespace.c | 13 +++
> > kernel/utsname.c | 12 +++
> > net/core/net_namespace.c | 12 +++
> > security/integrity/ima/ima_api.c | 2 +
> > 17 files changed, 309 insertions(+), 8 deletions(-)
- RGB
- RGB
--
Richard Guy Briggs <rbriggs(a)redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545