Re: [PATCH 00/34] fs: idmapped mounts

Thursday, 29 October 2020

On Thu, Oct 29, 2020 at 01:32:18AM +0100, Christian Brauner wrote:
...
 Hey everyone,

 I vanished for a little while to focus on this work here so sorry for
 not being available by mail for a while.

 Since quite a long time we have issues with sharing mounts between
 multiple unprivileged containers with different id mappings, sharing a
 rootfs between multiple containers with different id mappings, and also
 sharing regular directories and filesystems between users with different
 uids and gids. The latter use-cases have become even more important with
 the availability and adoption of systemd-homed (cf. [1]) to implement
 portable home directories.

 The solutions we have tried and proposed so far include the introduction
 of fsid mappings, a tiny overlay based filesystem, and an approach to
 call override creds in the vfs. None of these solutions have covered all
 of the above use-cases.

 The solution proposed here has it's origins in multiple discussions
 during Linux Plumbers 2017 during and after the end of the containers
 microconference.
 To the best of my knowledge this involved Aleksa, StÃ©phane, Eric, David,
 James, and myself. A variant of the solution proposed here has also been
 discussed, again to the best of my knowledge, after a Linux conference
 in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
 after Linux Plumbers.
 I've taken the time to finally implement a working version of this
 solution over the last weeks to the best of my abilities. Tycho has
 signed up for this sligthly crazy endeavour as well and he has helped
 with the conversion of the xattr codepaths.

 The core idea is to make idmappings a property of struct vfsmount
 instead of tying it to a process being inside of a user namespace which
 has been the case for all other proposed approaches.
 It means that idmappings become a property of bind-mounts, i.e. each
 bind-mount can have a separate idmapping. This has the obvious advantage
 that idmapped mounts can be created inside of the initial user
 namespace, i.e. on the host itself instead of requiring the caller to be
 located inside of a user namespace. This enables such use-cases as e.g.
 making a usb stick available in multiple locations with different
 idmappings (see the vfat port that is part of this patch series).

 The vfsmount struct gains a new struct user_namespace member. The
 idmapping of the user namespace becomes the idmapping of the mount. A
 caller that is either privileged with respect to the user namespace of
 the superblock of the underlying filesystem or a caller that is
 privileged with respect to the user namespace a mount has been idmapped
 with can create a new bind-mount and mark it with a user namespace. The
 user namespace the mount will be marked with can be specified by passing
 a file descriptor refering to the user namespace as an argument to the
 new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag.
 By default vfsmounts are marked with the initial user namespace and no
 behavioral or performance changes should be observed. All mapping
 operations are nops for the initial user namespace.

 When a file/inode is accessed through an idmapped mount the i_uid and
 i_gid of the inode will be remapped according to the user namespace the
 mount has been marked with. When a new object is created based on the
 fsuid and fsgid of the caller they will similarly be remapped according
 to the user namespace of the mount they care created from.

 This means the user namespace of the mount needs to be passed down into
 a few relevant inode_operations. This mostly includes inode operations
 that create filesystem objects or change file attributes. Some of them
 such as ->getattr() don't even need to change since they pass down a
 struct path and thus the struct vfsmount is already available. Other
 inode operations need to be adapted to pass down the user namespace the
 vfsmount has been marked with. Al was nice enough to point out that he
 will not tolerate struct vfsmount being passed to filesystems and that I
 should pass down the user namespace directly; which is what I did.
 The inode struct itself is never altered whenever the i_uid and i_gid
 need to be mapped, i.e. i_uid and i_gid are only remapped at the time of
 the check. An inode once initialized (during lookup or object creation)
 is never altered when accessed through an idmapped mount.

 To limit the amount of noise in this first iteration we have not changed
 the existing inode operations but rather introduced a few new struct
 inode operation methods such as ->mkdir_mapped which pass down the user
 namespace of the mount they have been called from. Should this solution
 be worth pursuing we have no problem adapting the existing inode
 operations instead.

 In order to support idmapped mounts, filesystems need to be changed and
 mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. In this first
 iteration I tried to illustrate this by changing three different
 filesystem with different levels of complexity. Of course with some bias
 towards urgent use-cases and filesystems I was at least a little more
 familiar with. However, Tycho and I (and others) have no problem
 converting each filesystem one-by-one. This first iteration includes fat
 (msdos and vfat), ext4, and overlayfs (both with idmapped lower and
 upper directories and idmapped merged directories). I'm sure I haven't
 gotten everything right for all three of them in the first version of
 this patch.

Thanks for this patchset. It's been a long-time coming.

I'm curious as to for the most cases, how much the new fs mount APIs help, and 
if focusing on those could solve the problem for everything other than bind 
mounts? Specifically, the idea of doing fsopen (creation of fs_context) under 
the user namespace of question, and relying on a user with CAP_SYS_ADMIN to call 
fsmount[1]. I think this is actually especially valuable for places like 
overlayfs that use the entire cred object, as opposed to just the uid / gid. I 
imagine that soon, most filesystems will support the new mount APIs, and not set 
the global flag if they don't need to.

How popular is the "vfsmount (bind mounts) needs different uid mappings" use 
case?

The other thing I worry about is the "What UID are you really?" game that's
been 
a thing recently. For example, you can have a different user namespace UID 
mapping for your network namespace that netfilter checks[2], and a different one 
for your mount namespace, and a different one that the process is actually in.
This proliferation of different mappings makes auditing, and doing things like
writing perf toolings more difficult (since I think bpf_get_current_uid_gid
use the initial user namespace still [3]).

[1]: https://lore.kernel.org/linux-nfs/20201016123745.9510-4-sargun@sargun.me/...
[2]: https://elixir.bootlin.com/linux/v5.9.1/source/net/netfilter/xt_owner.c#L37
[3]: https://elixir.bootlin.com/linux/v5.9.1/source/kernel/bpf/helpers.c#L196

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [PATCH 00/34] fs: idmapped mounts