On 15/04/24, Eric W. Biederman wrote:
Richard Guy Briggs <rgb(a)redhat.com> writes:
> On 15/04/22, Richard Guy Briggs wrote:
>> On 15/04/20, Eric W. Biederman wrote:
>> > Richard Guy Briggs <rgb(a)redhat.com> writes:
>> >
>> > > The purpose is to track namespace instances in use by logged processes
from the
>> > > perspective of init_*_ns by logging the namespace IDs (device ID and
namespace
>> > > inode - offset).
>> >
>> > In broad strokes the user interface appears correct.
>> >
>> > Things that I see that concern me:
>> >
>> > - After Als most recent changes these inodes no longer live in the proc
>> > superblock so the device number reported in these patches is
>> > incorrect.
>>
>> Ok, found the patchset you're talking about:
>> 3d3d35b kill proc_ns completely
>> e149ed2 take the targets of /proc/*/ns/* symlinks to separate fs
>> f77c801 bury struct proc_ns in fs/proc
>> 33c4294 copy address of proc_ns_ops into ns_common
>> 6344c43 new helpers: ns_alloc_inum/ns_free_inum
>> 6496452 make proc_ns_operations work with struct ns_common * instead of void *
>> 3c04118 switch the rest of proc_ns_operations to working with &...->ns
>> ff24870 netns: switch ->get()/->put()/->install()/->inum() to
working with &net->ns
>> 58be2825 make mntns ->get()/->put()/->install()/->inum() work with
&mnt_ns->ns
>> 435d5f4 common object embedded into various struct ....ns
>>
>> Ok, I've got some minor jigging to do to get inum too...
>
> Do I even need to report the device number anymore since I am concluding
> s_dev is never set (or always zero) in the nsfs filesystem by
> mount_pseudo() and isn't even mountable?
We still need the dev. We do have a device number get_anon_bdev fills it in.
Fine, it has a device number. There appears to be only one of these
allocated per kernel. I can get it from &nsfs->fs_supers (and take the
first instance given by hlist_for_each_entry and verify there are no
others). Why do I need it, again?
> In fact, I never needed to
> report the device since proc ida/idr and inodes are kernel-global and
> namespace-oblivious.
This is the bit I really want to keep to be forward looking. If we
every need to preserve the inode numbers across a migration we could
have different super blocks with different inode numbers for the same
namespace.
I don't quite follow your argument here, but can accept that in the
future we might add other namespace devices. I wonder if we might do
that augmentation later and leave out the device number for now...
>> > - I am nervous about audit logs being flooded with
users creating lots
>> > of namespaces. But that is more your lookout than mine.
>>
>> There was a thought to create a filter to en/disable this logging...
>> It is an auxiliary record to syscalls, so they can be ignored by userspace
tools.
>>
>> > - unshare is not logging when it creates new namespaces.
>>
>> They are all covered:
>> sys_unshare > unshare_userns > create_user_ns
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces >
copy_mnt_ns
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces >
copy_utsname > clone_uts_ns
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces >
copy_ipcs > get_ipc_ns
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces >
copy_pid_ns > create_pid_namespace
>> sys_unshare > unshare_nsproxy_namespaces > create_new_namespaces >
copy_net_ns
Then why the special change to fork? That was not reflected on
the unshare path as far as I could see.
Fork can specify more than one CLONE flag at once, so collecting them
all in one statementn seemed helpful. setns can only set one at a time.
>> > As small numbers are nice and these inodes all live in
their own
>> > superblock now we should be able to remove the games with
>> > PROC_DYNAMIC_FIRST and just use small numbers for these inodes
>> > everywhere.
>>
>> That is compelling if I can untangle the proc inode allocation code from the
>> ida/idr. Should be as easy as defining a new ns_alloc_inum (and ns_free_inum)
>> to use instead of proc_alloc_inum with its own ns_inum_ida and ns_inum_lock,
>> then defining a NS_DYNAMIC_FIRST and defining NS_{IPC,UTS,USER,PID}_INIT_INO in
>> the place of the existing PROC_*_INIT_INO.
Something like that. Just a new ida/idr allocator specific to that
superblock.
Yeah. It is somewhere on my todo, but I have been prioritizing getting
the bugs that look potentially expoloitable fixed in the mount
namespace. Al made things nice for one case but left a mess for a bunch
of others.
>> > I honestly don't know how much we are going to care about namespace
ids
>> > during migration. So far this is not a problem that has come up.
>>
>> Not for CRIU, but it will be an issue for a container auditor that aggregates
>> information from individually auditted hosts.
>>
>> > I don't think migration becomes a practical concern (other than
>> > interface wise) until achieve a non-init namespace auditd. The easy way
>> > to handle migration would be to log a setns of every process from their
>> > old namespaces to their new namespaces. As you appear to have a setns
>> > event defined.
>>
>> Again, this would be taken care of by a layer above that is container-aware
>> across multiple hosts.
>> > How to handle the more general case beyond audit remains unclear. I
>> > think it will be a little while yet before we start dealing with
>> > migrating applications that care. When we do we will either need to
>> > generate some kind of hot-plug event that userspace can respond to and
>> > discover all of the appropriate file-system nodes have changed, or we
>> > will need to build a mechanism in the kernel to preserve these numbers.
>>
>> I don't expect to need to preserve these numbers. The higher layer
application
>> will be able to do that translation.
We need to be very aware of what is happening.
The situation I am concerned about looks something like.
Program A:
fd1 = open(/proc/self/ns/net);
fstat(fd1, &stat1)
... later ...
fd2 = open(/var/run/netns/johnny);
fstat(fd2, &stat2);
if ((stat1.st_dev == stat2.st_dev) &&
(stat1.st_ino == stat2.st_ino)) {
/* Same netns do something... */
}
What happens when we migrate Program A with it's cached stat data of
of a network namespace file?
This requires either a hotplug event that Program A listens to or that
the inode number and device number are preserved across migration.
Exactly what we do depends on where we are when it comes up. But this
is not something some layer about the program can abstract it all out so
we don't need to worry about it.
Ok, understood, we can't just punt this one to a higher layer...
So this comes back to a question above, which is how do we determine
which device it is from? Sounds like we need something added to
ns_common or one of the 6 namespace types structs.
Eric
- RGB
--
Richard Guy Briggs <rbriggs(a)redhat.com>
Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545