On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
Hey everyone,
I vanished for a little while to focus on this work here so sorry for
not being available by mail for a while.
Since quite a long time we have issues with sharing mounts between
multiple unprivileged containers with different id mappings, sharing a
rootfs between multiple containers with different id mappings, and also
sharing regular directories and filesystems between users with different
uids and gids. The latter use-cases have become even more important with
the availability and adoption of systemd-homed (cf. [1]) to implement
portable home directories.
The solutions we have tried and proposed so far include the introduction
of fsid mappings, a tiny overlay based filesystem, and an approach to
call override creds in the vfs. None of these solutions have covered all
of the above use-cases.
The solution proposed here has it's origins in multiple discussions
during Linux Plumbers 2017 during and after the end of the containers
microconference.
To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
James, and myself. A variant of the solution proposed here has also been
discussed, again to the best of my knowledge, after a Linux conference
in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
after Linux Plumbers.
I've taken the time to finally implement a working version of this
solution over the last weeks to the best of my abilities. Tycho has
signed up for this sligthly crazy endeavour as well and he has helped
with the conversion of the xattr codepaths.
The core idea is to make idmappings a property of struct vfsmount
instead of tying it to a process being inside of a user namespace which
has been the case for all other proposed approaches.
It means that idmappings become a property of bind-mounts, i.e. each
bind-mount can have a separate idmapping. This has the obvious advantage
that idmapped mounts can be created inside of the initial user
namespace, i.e. on the host itself instead of requiring the caller to be
located inside of a user namespace. This enables such use-cases as e.g.
making a usb stick available in multiple locations with different
idmappings (see the vfat port that is part of this patch series).
The vfsmount struct gains a new struct user_namespace member. The
idmapping of the user namespace becomes the idmapping of the mount. A
caller that is either privileged with respect to the user namespace of
the superblock of the underlying filesystem or a caller that is
privileged with respect to the user namespace a mount has been idmapped
with can create a new bind-mount and mark it with a user namespace. The
user namespace the mount will be marked with can be specified by passing
a file descriptor refering to the user namespace as an argument to the
new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes should be observed. All mapping
operations are nops for the initial user namespace.
When a file/inode is accessed through an idmapped mount the i_uid and
i_gid of the inode will be remapped according to the user namespace the
mount has been marked with. When a new object is created based on the
fsuid and fsgid of the caller they will similarly be remapped according
to the user namespace of the mount they care created from.
This means the user namespace of the mount needs to be passed down into
a few relevant inode_operations. This mostly includes inode operations
that create filesystem objects or change file attributes.
That's really quite ... messy.
Maybe I'm missing something, but if you have the user_ns to be used
for the VFS operation we are about to execute then why can't we use
the same model as current_fsuid/current_fsgid() for passing the
filesystem credentials down to the filesystem operations? i.e.
attach it to the current->cred->fs_userns, and then the filesystem
code that actually needs to know the current userns can call
current_fs_user_ns() instead of current_user_ns(). i.e.
#define current_fs_user_ns() \
(current->cred->fs_userns ? current->cred->fs_userns \
: current->cred->userns)
At this point, the filesystem will now always have the correct
userns it is supposed to use for mapping the uid/gid, right?
Also, if we are passing work off to worker threads, duplicating
the current creds will capture this information and won't leave
random landmines where stuff doesn't work as it should because the
worker thread is unaware of the userns that it is supposed to be
doing filesytsem operations under...
Cheers,
Dave.
--
Dave Chinner
david(a)fromorbit.com