On Sun, Nov 15, 2020 at 5:39 AM Christian Brauner
<christian.brauner(a)ubuntu.com> wrote:
When interacting with user namespace and non-user namespace aware
filesystem capabilities the vfs will perform various security checks to
determine whether or not the filesystem capabilities can be used by the
caller (e.g. during exec), or even whether they need to be removed. The
main infrastructure for this resides in the capability codepaths but they
are called through the LSM security infrastructure even though they are not
technically an LSM or optional. This extends the existing security hooks
security_inode_removexattr(), security_inode_killpriv(),
security_inode_getsecurity() to pass down the mount's user namespace and
makes them aware of idmapped mounts.
In order to actually get filesystem capabilities from disk the capability
infrastructure exposes the get_vfs_caps_from_disk() helper. For user
namespace aware filesystem capabilities a root uid is stored alongside the
capabilities.
In order to determine whether the caller can make use of the filesystem
capability or whether it needs to be ignored it is translated according to
the superblock's user namespace. If it can be translated to uid 0 according
to that id mapping the caller can use the filesystem capabilities stored on
disk. If we are accessing the inode that holds the filesystem capabilities
through an idmapped mount we need to map the root uid according to the
mount's user namespace.
Afterwards the checks are identical to non-idmapped mounts. Reading
filesystem caps from disk enforces that the root uid associated with the
filesystem capability must have a mapping in the superblock's user
namespace and that the caller is either in the same user namespace or is a
descendant of the superblock's user namespace. For filesystems that are
mountable inside user namespace the container can just mount the filesystem
and won't usually need to idmap it. If it does create an idmapped mount it
can mark it with a user namespace it has created and which is therefore a
descendant of the s_user_ns. For filesystems that are not mountable inside
user namespaces the descendant rule is trivially true because the s_user_ns
will be the initial user namespace.
If the initial user namespace is passed all operations are a nop so
non-idmapped mounts will not see a change in behavior and will also not see
any performance impact.
Cc: Christoph Hellwig <hch(a)lst.de>
Cc: David Howells <dhowells(a)redhat.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: linux-fsdevel(a)vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner(a)ubuntu.com>
...
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 8dba8f0983b5..ddb9213a3e81 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1944,7 +1944,7 @@ static inline int audit_copy_fcaps(struct audit_names *name,
if (!dentry)
return 0;
- rc = get_vfs_caps_from_disk(dentry, &caps);
+ rc = get_vfs_caps_from_disk(&init_user_ns, dentry, &caps);
if (rc)
return rc;
@@ -2495,7 +2495,8 @@ int __audit_log_bprm_fcaps(struct linux_binprm *bprm,
ax->d.next = context->aux;
context->aux = (void *)ax;
- get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
+ get_vfs_caps_from_disk(mnt_user_ns(bprm->file->f_path.mnt),
+ bprm->file->f_path.dentry, &vcaps);
As audit currently records information in the context of the
initial/host namespace I'm guessing we don't want the mnt_user_ns()
call above; it seems like &init_user_ns would be the right choice
(similar to audit_copy_fcaps()), yes?
ax->fcap.permitted = vcaps.permitted;
ax->fcap.inheritable = vcaps.inheritable;
--
paul moore
www.paul-moore.com