[PATCH] audit: always enable syscall auditing when supported and audit is enabled
by Paul Moore
To the best of our knowledge, everyone who enables audit at compile
time also enables syscall auditing; this patch simplifies the Kconfig
menus by removing the option to disable syscall auditing when audit
is selected and the target arch supports it.
Signed-off-by: Paul Moore <pmoore(a)redhat.com>
---
init/Kconfig | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)
diff --git a/init/Kconfig b/init/Kconfig
index c24b6f7..d4663b1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -299,20 +299,15 @@ config AUDIT
help
Enable auditing infrastructure that can be used with another
kernel subsystem, such as SELinux (which requires this for
- logging of avc messages output). Does not do system-call
- auditing without CONFIG_AUDITSYSCALL.
+ logging of avc messages output). System call auditing is included
+ on architectures which support it.
config HAVE_ARCH_AUDITSYSCALL
bool
config AUDITSYSCALL
- bool "Enable system-call auditing support"
+ def_bool y
depends on AUDIT && HAVE_ARCH_AUDITSYSCALL
- default y if SECURITY_SELINUX
- help
- Enable low-overhead system-call auditing infrastructure that
- can be used independently or with another kernel subsystem,
- such as SELinux.
config AUDIT_WATCH
def_bool y
5 years, 9 months
[PATCH ALT4 V3 1/2] audit: show fstype:pathname for entries with anonymous parents
by Richard Guy Briggs
Tracefs or debugfs were causing hundreds to thousands of null PATH
records to be associated with the init_module and finit_module SYSCALL
records on a few modules when the following rule was in place for
startup:
-a always,exit -F arch=x86_64 -S init_module -F key=mod-load
This happens because the parent inode is not found in the task's
audit_names list and hence treats it as anonymous. This gives us no
information other than a numerical device number that may no longer be
visible upon log inspeciton, and an inode number.
Fill in the filesystem type, filesystem magic number and full pathname
from the filesystem mount point on previously null PATH records from
entries that have an anonymous parent from the child dentry using
dentry_path_raw().
Make the dentry argument of __audit_inode_child() non-const so that we
can take a reference to it in the case of an anonymous parent with
dget() and dget_parent() to be able to later print a partial path from
the host filesystem rather than null.
Since all we are given is an inode of the parent and the dentry of the
child, finding the path from the mount point to the root of the
filesystem is more challenging that would involve searching all
vfsmounts from "/" until a matching dentry is found for that
filesystem's root dentry. Even if one is found, there may be more than
one mount point. At this point the gain seems marginal since
knowing the filesystem type and path are a significant help in tracking
down the source of the PATH records and being to address them.
Sample output:
type=PROCTITLE msg=audit(1488317694.446:143): proctitle=2F7362696E2F6D6F6470726F6265002D71002D2D006E66737634
type=PATH msg=audit(1488317694.446:143): item=797 name=tracefs(74726163):/events/nfs4/nfs4_setclientid/format inode=15969 dev=00:09 mode=0100444 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tracefs_t:s0 nametype=CREATE
type=PATH msg=audit(1488317694.446:143): item=796 name=tracefs(74726163):/events/nfs4/nfs4_setclientid inode=15964 dev=00:09 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tracefs_t:s0 nametype=PARENT
...
type=PATH msg=audit(1488317694.446:143): item=1 name=tracefs(74726163):/events/nfs4 inode=15571 dev=00:09 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tracefs_t:s0 nametype=CREATE
type=PATH msg=audit(1488317694.446:143): item=0 name=tracefs(74726163):/events inode=119 dev=00:09 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:tracefs_t:s0 nametype=PARENT
type=UNKNOWN[1330] msg=audit(1488317694.446:143): name="nfsv4"
type=SYSCALL msg=audit(1488317694.446:143): arch=c000003e syscall=313 success=yes exit=0 a0=1 a1=55d5a35ce106 a2=0 a3=1 items=798 ppid=6 pid=528 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="modprobe" exe="/usr/bin/kmod" subj=system_u:system_r:insmod_t:s0 key="mod-load"
See: https://github.com/linux-audit/audit-kernel/issues/8
Test case: https://github.com/linux-audit/audit-testsuite/issues/42
Signed-off-by: Richard Guy Briggs <rgb(a)redhat.com>
---
v3:
fix audit_buffer leak and dname error allocation leak audit_log_name
only put audit_name->dentry if it is being replaced
v2:
minor cosmetic changes and support fs filter patch
---
include/linux/audit.h | 8 ++++----
kernel/audit.c | 19 +++++++++++++++++++
kernel/audit.h | 1 +
kernel/auditsc.c | 8 +++++++-
4 files changed, 31 insertions(+), 5 deletions(-)
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 2150bdc..1ef4ec8 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -240,7 +240,7 @@ extern void __audit_inode(struct filename *name, const struct dentry *dentry,
unsigned int flags);
extern void __audit_file(const struct file *);
extern void __audit_inode_child(struct inode *parent,
- const struct dentry *dentry,
+ struct dentry *dentry,
const unsigned char type);
extern void __audit_seccomp(unsigned long syscall, long signr, int code);
extern void __audit_ptrace(struct task_struct *t);
@@ -305,7 +305,7 @@ static inline void audit_inode_parent_hidden(struct filename *name,
AUDIT_INODE_PARENT | AUDIT_INODE_HIDDEN);
}
static inline void audit_inode_child(struct inode *parent,
- const struct dentry *dentry,
+ struct dentry *dentry,
const unsigned char type) {
if (unlikely(!audit_dummy_context()))
__audit_inode_child(parent, dentry, type);
@@ -486,7 +486,7 @@ static inline void __audit_inode(struct filename *name,
unsigned int flags)
{ }
static inline void __audit_inode_child(struct inode *parent,
- const struct dentry *dentry,
+ struct dentry *dentry,
const unsigned char type)
{ }
static inline void audit_inode(struct filename *name,
@@ -500,7 +500,7 @@ static inline void audit_inode_parent_hidden(struct filename *name,
const struct dentry *dentry)
{ }
static inline void audit_inode_child(struct inode *parent,
- const struct dentry *dentry,
+ struct dentry *dentry,
const unsigned char type)
{ }
static inline void audit_core_dumps(long signr)
diff --git a/kernel/audit.c b/kernel/audit.c
index 59e60e0..d6e6e4e 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -72,6 +72,7 @@
#include <linux/freezer.h>
#include <linux/pid_namespace.h>
#include <net/netns/generic.h>
+#include <linux/dcache.h>
#include "audit.h"
@@ -2047,6 +2048,10 @@ void audit_copy_inode(struct audit_names *name, const struct dentry *dentry,
name->gid = inode->i_gid;
name->rdev = inode->i_rdev;
security_inode_getsecid(inode, &name->osid);
+ if (name->dentry) {
+ dput(name->dentry);
+ name->dentry = NULL;
+ }
audit_copy_fcaps(name, dentry);
}
@@ -2088,6 +2093,20 @@ void audit_log_name(struct audit_context *context, struct audit_names *n,
audit_log_n_untrustedstring(ab, n->name->name,
n->name_len);
}
+ } else if (n->dentry) {
+ char *fullpath;
+ const char *fullpathp = NULL;
+
+ fullpath = kmalloc(PATH_MAX, GFP_KERNEL);
+ if (fullpath)
+ fullpathp = dentry_path_raw(n->dentry, fullpath, PATH_MAX);
+ if (IS_ERR(fullpathp)) {
+ fullpathp = NULL;
+ kfree(fullpath);
+ }
+ audit_log_format(ab, " name=%s(0x%lx):%s",
+ n->dentry->d_sb->s_type->name ?: "?",
+ n->dentry->d_sb->s_magic, fullpathp ?: "?");
} else
audit_log_format(ab, " name=(null)");
diff --git a/kernel/audit.h b/kernel/audit.h
index b331d9b..c01defb 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -85,6 +85,7 @@ struct audit_names {
unsigned long ino;
dev_t dev;
+ struct dentry *dentry;
umode_t mode;
kuid_t uid;
kgid_t gid;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 4a42db5..11848df 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -74,6 +74,7 @@
#include <linux/string.h>
#include <linux/uaccess.h>
#include <uapi/linux/limits.h>
+#include <linux/dcache.h>
#include "audit.h"
@@ -881,6 +882,8 @@ static inline void audit_free_names(struct audit_context *context)
list_del(&n->list);
if (n->name)
putname(n->name);
+ if (n->dentry)
+ dput(n->dentry);
if (n->should_free)
kfree(n);
}
@@ -1861,7 +1864,7 @@ void __audit_file(const struct file *file)
* unsuccessful attempts.
*/
void __audit_inode_child(struct inode *parent,
- const struct dentry *dentry,
+ struct dentry *dentry,
const unsigned char type)
{
struct audit_context *context = current->audit_context;
@@ -1917,6 +1920,7 @@ void __audit_inode_child(struct inode *parent,
if (!n)
return;
audit_copy_inode(n, NULL, parent);
+ n->dentry = dget_parent(dentry);
}
if (!found_child) {
@@ -1938,6 +1942,8 @@ void __audit_inode_child(struct inode *parent,
audit_copy_inode(found_child, dentry, inode);
else
found_child->ino = AUDIT_INO_UNSET;
+ if (!found_parent)
+ found_child->dentry = dget(dentry);
}
EXPORT_SYMBOL_GPL(__audit_inode_child);
--
1.7.1
6 years, 11 months
[PATCH V3] filter: add filesystem filter with fstype
by Richard Guy Briggs
Tracefs or debugfs were causing hundreds to thousands of PATH records to
be associated with the init_module and finit_module SYSCALL records on a
few modules when the following rule was in place for startup:
-a always,exit -F arch=x86_64 -S init_module -F key=mod-load
Add the new "filesystem" filter list anchored in __audit_inode_child() to
filter out PATH records from uninteresting filesystem types, "fstype",
keying on their kernel hexadecimal 4-octet magic identifier.
An example rule would look like:
-a never,filesystem -F fstype=0x74726163 -F key=ignore_tracefs
-a never,filesystem -F fstype=0x64626720 -F key=ignore_debugfs
Note: "always,filesystem" will log the PATH record anyways and add latency.
See: https://github.com/linux-audit/audit-kernel/issues/8
See: https://github.com/linux-audit/audit-userspace/issues/15
Test case: https://github.com/linux-audit/audit-testsuite/issues/42
Signed-off-by: Richard Guy Briggs <rgb(a)redhat.com>
---
v3:
Update feature bitmap macros to reflect filter name change.
v2:
Change filter name from "path" to "filesystem".
Rebase onto other patches accepted upstream.
docs/audit_add_rule_data.3 | 3 +++
lib/errormsg.h | 5 +++++
lib/fieldtab.h | 2 ++
lib/flagtab.h | 10 ++++++----
lib/libaudit.c | 26 ++++++++++++++++++++++++--
lib/libaudit.h | 10 ++++++++++
lib/private.h | 1 +
src/auditctl-listing.c | 6 ++++--
src/auditctl.c | 16 ++++++++++++++--
9 files changed, 69 insertions(+), 10 deletions(-)
diff --git a/docs/audit_add_rule_data.3 b/docs/audit_add_rule_data.3
index a0802c0..1e7540c 100644
--- a/docs/audit_add_rule_data.3
+++ b/docs/audit_add_rule_data.3
@@ -22,6 +22,9 @@ AUDIT_FILTER_EXIT - Apply rule at syscall exit. This is the main filter that is
.TP
\(bu
AUDIT_FILTER_TYPE - Apply rule at audit_log_start. This is the exclude filter which discards any records that match.
+.TP
+\(bu
+AUDIT_FILTER_FS - Apply rule when adding PATH auxiliary records to SYSCALL events. This is the filesystem filter. This is used to ignore PATH records that are not of interest.
.LP
.PP
diff --git a/lib/errormsg.h b/lib/errormsg.h
index 91d8252..ef54589 100644
--- a/lib/errormsg.h
+++ b/lib/errormsg.h
@@ -20,6 +20,7 @@
* Authors:
* Zhang Xiliang <zhangxiliang(a)cn.fujitsu.com>
* Steve Grubb <sgrubb(a)redhat.com>
+ * Richard Guy Briggs <rgb(a)redhat.com>
*/
struct msg_tab {
@@ -66,6 +67,8 @@ struct msg_tab {
#define EAU_FIELDNOFILTER 31
#define EAU_FILTERMISSING 32
#define EAU_COMPINCOMPAT 33
+#define EAU_FIELDUNAVAIL 34
+#define EAU_FILTERNOSUPPORT 35
static const struct msg_tab err_msgtab[] = {
{ -EAU_OPMISSING, 2, "-F missing operation for" },
{ -EAU_FIELDUNKNOWN, 2, "-F unknown field:" },
@@ -100,5 +103,7 @@ static const struct msg_tab err_msgtab[] = {
{ -EAU_FIELDNOFILTER, 1, "must be used with exclude, user, or exit filter" },
{ -EAU_FILTERMISSING, 0, "filter is missing from rule" },
{ -EAU_COMPINCOMPAT, 2, "-C incompatible comparison" },
+ { -EAU_FIELDUNAVAIL, 1, "field is not valid for the filter" },
+ { -EAU_FILTERNOSUPPORT, 1, "filter is not supported ty kernel" },
};
#endif
diff --git a/lib/fieldtab.h b/lib/fieldtab.h
index 0c5e39d..c425d5b 100644
--- a/lib/fieldtab.h
+++ b/lib/fieldtab.h
@@ -18,6 +18,7 @@
*
* Authors:
* Steve Grubb <sgrubb(a)redhat.com>
+ * Richard Guy Briggs <rgb(a)redhat.com>
*/
_S(AUDIT_PID, "pid" )
@@ -56,6 +57,7 @@ _S(AUDIT_WATCH, "path" )
_S(AUDIT_PERM, "perm" )
_S(AUDIT_DIR, "dir" )
_S(AUDIT_FILETYPE, "filetype" )
+_S(AUDIT_FSTYPE, "fstype" )
_S(AUDIT_OBJ_UID, "obj_uid" )
_S(AUDIT_OBJ_GID, "obj_gid" )
_S(AUDIT_FIELD_COMPARE, "field_compare" )
diff --git a/lib/flagtab.h b/lib/flagtab.h
index 4b04692..7a618e0 100644
--- a/lib/flagtab.h
+++ b/lib/flagtab.h
@@ -18,8 +18,10 @@
*
* Authors:
* Steve Grubb <sgrubb(a)redhat.com>
+ * Richard Guy Briggs <rgb(a)redhat.com>
*/
-_S(AUDIT_FILTER_TASK, "task" )
-_S(AUDIT_FILTER_EXIT, "exit" )
-_S(AUDIT_FILTER_USER, "user" )
-_S(AUDIT_FILTER_EXCLUDE, "exclude" )
+_S(AUDIT_FILTER_TASK, "task" )
+_S(AUDIT_FILTER_EXIT, "exit" )
+_S(AUDIT_FILTER_USER, "user" )
+_S(AUDIT_FILTER_EXCLUDE, "exclude" )
+_S(AUDIT_FILTER_FS, "filesystem")
diff --git a/lib/libaudit.c b/lib/libaudit.c
index 18cd384..58134a2 100644
--- a/lib/libaudit.c
+++ b/lib/libaudit.c
@@ -19,6 +19,7 @@
* Authors:
* Steve Grubb <sgrubb(a)redhat.com>
* Rickard E. (Rik) Faith <faith(a)redhat.com>
+ * Richard Guy Briggs <rgb(a)redhat.com>
*/
#include "config.h"
@@ -85,6 +86,7 @@ int _audit_permadded = 0;
int _audit_archadded = 0;
int _audit_syscalladded = 0;
int _audit_exeadded = 0;
+int _audit_filterfsadded = 0;
unsigned int _audit_elf = 0U;
static struct libaudit_conf config;
@@ -1466,6 +1468,23 @@ int audit_rule_fieldpair_data(struct audit_rule_data **rulep, const char *pair,
}
}
+ /* FS filter can be used only with FSTYPE field */
+ if (flags == AUDIT_FILTER_FS) {
+ uint32_t features = audit_get_features();
+ if ((features & AUDIT_FEATURE_BITMAP_FILTER_FS) == 0) {
+ return -EAU_FILTERNOSUPPORT;
+ } else {
+ switch(field) {
+ case AUDIT_FSTYPE:
+ _audit_filterfsadded = 1;
+ case AUDIT_FILTERKEY:
+ break;
+ default:
+ return -EAU_FIELDUNAVAIL;
+ }
+ }
+ }
+
rule->fields[rule->field_count] = field;
rule->fieldflags[rule->field_count] = op;
switch (field)
@@ -1580,7 +1599,8 @@ int audit_rule_fieldpair_data(struct audit_rule_data **rulep, const char *pair,
}
if (field == AUDIT_FILTERKEY &&
!(_audit_syscalladded || _audit_permadded ||
- _audit_exeadded))
+ _audit_exeadded ||
+ _audit_filterfsadded))
return -EAU_KEYDEP;
vlen = strlen(v);
if (field == AUDIT_FILTERKEY &&
@@ -1715,7 +1735,7 @@ int audit_rule_fieldpair_data(struct audit_rule_data **rulep, const char *pair,
return -EAU_EXITONLY;
/* fallthrough */
default:
- if (field == AUDIT_INODE) {
+ if (field == AUDIT_INODE || field == AUDIT_FSTYPE) {
if (!(op == AUDIT_NOT_EQUAL ||
op == AUDIT_EQUAL))
return -EAU_OPEQNOTEQ;
@@ -1727,6 +1747,8 @@ int audit_rule_fieldpair_data(struct audit_rule_data **rulep, const char *pair,
if (!isdigit((char)*(v)))
return -EAU_FIELDVALNUM;
+ if (field == AUDIT_FSTYPE && flags != AUDIT_FILTER_FS)
+ return -EAU_FIELDUNAVAIL;
if (field == AUDIT_INODE)
rule->values[rule->field_count] =
strtoul(v, NULL, 0);
diff --git a/lib/libaudit.h b/lib/libaudit.h
index e5c7a4d..70646cd 100644
--- a/lib/libaudit.h
+++ b/lib/libaudit.h
@@ -277,6 +277,9 @@ extern "C" {
#define AUDIT_KEY_SEPARATOR 0x01
/* These are used in filter control */
+#ifndef AUDIT_FILTER_FS
+#define AUDIT_FILTER_FS 0x06 /* FS record filter in __audit_inode_child */
+#endif
#define AUDIT_FILTER_EXCLUDE AUDIT_FILTER_TYPE
#define AUDIT_FILTER_MASK 0x07 /* Mask to get actual filter */
#define AUDIT_FILTER_UNSET 0x80 /* This value means filter is unset */
@@ -305,6 +308,9 @@ extern "C" {
#ifndef AUDIT_FEATURE_BITMAP_LOST_RESET
#define AUDIT_FEATURE_BITMAP_LOST_RESET 0x00000020
#endif
+#ifndef AUDIT_FEATURE_BITMAP_FILTER_FS
+#define AUDIT_FEATURE_BITMAP_FILTER_FS 0x00000040
+#endif
/* Defines for interfield comparison update */
#ifndef AUDIT_OBJ_UID
@@ -324,6 +330,10 @@ extern "C" {
#define AUDIT_SESSIONID 25
#endif
+#ifndef AUDIT_FSTYPE
+#define AUDIT_FSTYPE 26
+#endif
+
#ifndef AUDIT_COMPARE_UID_TO_OBJ_UID
#define AUDIT_COMPARE_UID_TO_OBJ_UID 1
#endif
diff --git a/lib/private.h b/lib/private.h
index cde1906..bd5e8b3 100644
--- a/lib/private.h
+++ b/lib/private.h
@@ -139,6 +139,7 @@ extern int _audit_permadded;
extern int _audit_archadded;
extern int _audit_syscalladded;
extern int _audit_exeadded;
+extern int _audit_filterfsadded;
extern unsigned int _audit_elf;
#ifdef __cplusplus
diff --git a/src/auditctl-listing.c b/src/auditctl-listing.c
index 3bc8e71..50bc0b8 100644
--- a/src/auditctl-listing.c
+++ b/src/auditctl-listing.c
@@ -91,7 +91,8 @@ static int is_watch(const struct audit_rule_data *r)
if (((r->flags & AUDIT_FILTER_MASK) != AUDIT_FILTER_USER) &&
((r->flags & AUDIT_FILTER_MASK) != AUDIT_FILTER_TASK) &&
- ((r->flags & AUDIT_FILTER_MASK) != AUDIT_FILTER_EXCLUDE)) {
+ ((r->flags & AUDIT_FILTER_MASK) != AUDIT_FILTER_EXCLUDE) &&
+ ((r->flags & AUDIT_FILTER_MASK) != AUDIT_FILTER_FS)) {
for (i = 0; i < (AUDIT_BITMASK_SIZE-1); i++) {
if (r->mask[i] != (uint32_t)~0) {
all = 0;
@@ -139,7 +140,8 @@ static int print_syscall(const struct audit_rule_data *r, unsigned int *sc)
/* Rules on the following filters do not take a syscall */
if (((r->flags & AUDIT_FILTER_MASK) == AUDIT_FILTER_USER) ||
((r->flags & AUDIT_FILTER_MASK) == AUDIT_FILTER_TASK) ||
- ((r->flags &AUDIT_FILTER_MASK) == AUDIT_FILTER_EXCLUDE))
+ ((r->flags &AUDIT_FILTER_MASK) == AUDIT_FILTER_EXCLUDE) ||
+ ((r->flags &AUDIT_FILTER_MASK) == AUDIT_FILTER_FS))
return 0;
/* See if its all or specific syscalls */
diff --git a/src/auditctl.c b/src/auditctl.c
index 04765f4..b99c957 100644
--- a/src/auditctl.c
+++ b/src/auditctl.c
@@ -19,6 +19,7 @@
* Authors:
* Steve Grubb <sgrubb(a)redhat.com>
* Rickard E. (Rik) Faith <faith(a)redhat.com>
+ * Richard Guy Briggs <rgb(a)redhat.com>
*/
#include "config.h"
@@ -74,6 +75,7 @@ static int reset_vars(void)
_audit_permadded = 0;
_audit_archadded = 0;
_audit_exeadded = 0;
+ _audit_filterfsadded = 0;
_audit_elf = 0;
add = AUDIT_FILTER_UNSET;
del = AUDIT_FILTER_UNSET;
@@ -151,6 +153,8 @@ static int lookup_filter(const char *str, int *filter)
*filter = AUDIT_FILTER_EXIT;
else if (strcmp(str, "user") == 0)
*filter = AUDIT_FILTER_USER;
+ else if (strcmp(str, "filesystem") == 0)
+ *filter = AUDIT_FILTER_FS;
else if (strcmp(str, "exclude") == 0) {
*filter = AUDIT_FILTER_EXCLUDE;
exclude = 1;
@@ -760,6 +764,13 @@ static int setopt(int count, int lineno, char *vars[])
audit_msg(LOG_ERR,
"Error: syscall auditing being added to user list");
return -1;
+ } else if (((add & (AUDIT_FILTER_MASK|AUDIT_FILTER_UNSET)) ==
+ AUDIT_FILTER_FS || (del &
+ (AUDIT_FILTER_MASK|AUDIT_FILTER_UNSET)) ==
+ AUDIT_FILTER_FS)) {
+ audit_msg(LOG_ERR,
+ "Error: syscall auditing being added to filesystem list");
+ return -1;
} else if (exclude) {
audit_msg(LOG_ERR,
"Error: syscall auditing cannot be put on exclude list");
@@ -936,8 +947,9 @@ static int setopt(int count, int lineno, char *vars[])
break;
case 'k':
if (!(_audit_syscalladded || _audit_permadded ||
- _audit_exeadded) || (add==AUDIT_FILTER_UNSET &&
- del==AUDIT_FILTER_UNSET)) {
+ _audit_exeadded ||
+ _audit_filterfsadded) ||
+ (add==AUDIT_FILTER_UNSET && del==AUDIT_FILTER_UNSET)) {
audit_msg(LOG_ERR,
"key option needs a watch or syscall given prior to it");
retval = -1;
--
1.7.1
7 years
ausearch --text : missing information
by Maupertuis Philippe
Hi,
I was toying with the audit pci configuration.
I opened a root session with sudo in which I just typed C-r nss to retrieve the command "less /etc/nsswitch.conf" from the bash_history.
The text format, as shown below, doesn't handle correctly the tty_audit information.
Is it a known limitation ?
Ausearch format text
On yppcil51s.sys.meshcore.net at 10:23:34 21/08/17 fr18358, acting as root, successfully changed-identity-of /usr/bin/sudo using setresuid
On yppcil51s.sys.meshcore.net at 10:24:08 21/08/17 fr18358, acting as root, typed
On yppcil51s.sys.meshcore.net at 10:24:08 21/08/17 fr18358, acting as root, did-unknown
On yppcil51s.sys.meshcore.net at 10:24:14 21/08/17 fr18358, acting as root, successfully ended-session /dev/pts/0
Ausearch -I format raw
----
node=yppcil51s.sys.meshcore.net type=PROCTITLE msg=audit(21/08/17 10:23:34.400:20501) : proctitle=sudo -i
node=yppcil51s.sys.meshcore.net type=SYSCALL msg=audit(21/08/17 10:23:34.400:20501) : arch=x86_64 syscall=setresuid success=yes exit=0 a0=root a1=root a2=root a3=0x7fab09de8300 items=0 ppid=20742 pid=20743 auid=fr18358 uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=pts0 ses=1287 comm=sudo exe=/usr/bin/sudo key=10.2.5.b-elevated-privs-session
----
node=yppcil51s.sys.meshcore.net type=USER_TTY msg=audit(21/08/17 10:24:08.661:20503) : pid=20743 uid=root auid=fr18358 ses=1287 data="less /etc/nsswitch.conf"
----
node=yppcil51s.sys.meshcore.net type=TTY msg=audit(21/08/17 10:24:08.661:20502) : tty pid=20743 uid=root auid=fr18358 ses=1287 major=136 minor=0 comm=bash data=<^R>,"nss",<ret>
----
node=yppcil51s.sys.meshcore.net type=USER_END msg=audit(21/08/17 10:24:14.479:20506) : pid=20742 uid=root auid=fr18358 ses=1287 msg='op=PAM:session_close grantors=pam_keyinit,pam_limits acct=root exe=/usr/bin/sudo hostname=? addr=? terminal=/dev/pts/0 res=success'
ausearch format raw
node=yppcil51s.sys.meshcore.net type=SYSCALL msg=audit(1503303814.394:20497): arch=c000003e syscall=117 success=yes exit=0 a0=0 a1=ffffffff a2=ffffffff a3=7fab09de8300 items=0 ppid=20717 pid=20742 auid=3318358 uid=0 gid=20599 euid=0 suid=0 fsuid=0 egid=20599 sgid=20599 fsgid=20599 tty=pts0 ses=1287 comm="sudo" exe="/usr/bin/sudo" key="10.2.5.b-elevated-privs-session"ARCH=x86_64 SYSCALL=setresuid AUID="fr18358" UID="root" GID="nobody" EUID="root" SUID="root" FSUID="root" EGID="nobody" SGID="nobody" FSGID="nobody"
node=yppcil51s.sys.meshcore.net type=PROCTITLE msg=audit(1503303814.394:20497): proctitle=7375646F002D69
node=yppcil51s.sys.meshcore.net type=SYSCALL msg=audit(1503303814.400:20501): arch=c000003e syscall=117 success=yes exit=0 a0=0 a1=0 a2=0 a3=7fab09de8300 items=0 ppid=20742 pid=20743 auid=3318358 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=1287 comm="sudo" exe="/usr/bin/sudo" key="10.2.5.b-elevated-privs-session"ARCH=x86_64 SYSCALL=setresuid AUID="fr18358" UID="root" GID="root" EUID="root" SUID="root" FSUID="root" EGID="root" SGID="root" FSGID="root"
node=yppcil51s.sys.meshcore.net type=PROCTITLE msg=audit(1503303814.400:20501): proctitle=7375646F002D69
node=yppcil51s.sys.meshcore.net type=USER_TTY msg=audit(1503303848.661:20503): pid=20743 uid=0 auid=3318358 ses=1287 data=6C657373202F6574632F6E737377697463682E636F6E66UID="root" AUID="fr18358"
Additionally, I noticed that ausearch -f /etc/nsswitch.conf doesn't return anything.
It may be working as expected but I doubt it would be very usable to find out who fiddled with a file.
Has someone on the list successfully used the PCI rules in an actual PCI audit ?
Philippe
!!!*************************************************************************************
"Ce message et les pi?ces jointes sont confidentiels et r?serv?s ? l'usage exclusif de ses destinataires. Il peut ?galement ?tre prot?g? par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire. L'int?grit? du message ne pouvant ?tre assur?e sur Internet, la responsabilit? de Worldline ne pourra ?tre recherch?e quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'exp?diteur ne donne aucune garantie ? cet ?gard et sa responsabilit? ne saurait ?tre recherch?e pour tout dommage r?sultant d'un virus transmis.
This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Worldline liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.!!!"
7 years, 1 month
[PATCH] capabilities: add field names for ambient capabilities
by Richard Guy Briggs
Linux kernel capabilities were augmented to include ambient capabilities in
v4.3 commit 58319057b784 ("capabilities: ambient capabilities").
Add interpretation types for cap_pa, old_pa, pa.
The record contains fields "old_pp", "old_pi", "old_pe", "new_pp",
"new_pi", "new_pe" so in keeping with the previous record
normalizations, change the "new_p*" variants to simply drop the "new_"
prefix.
A sample of the replaced BPRM_FCAPS record:
RAW: type=BPRM_FCAPS msg=audit(1491468034.252:237): fver=2 fp=0000000000200000 fi=0000000000000000 fe=1 old_pp=0000000000000000 old_pi=0000000000000000 old_pe=0000000000000000 old_pa=0000000000000000 pp=0000000000200000 pi=0000000000000000 pe=0000000000200000 pa=0000000000000000
INTERPRET: type=BPRM_FCAPS msg=audit(04/06/2017 04:40:34.252:237) : fver=2 fp=sys_admin fi=none fe=chown old_pp=none old_pi=none old_pe=none old_pa=none pp=sys_admin pi=none pe=sys_admin pa=none
A sample of the replaced CAPSET record:
RAW: type=CAPSET msg=audit(1491469502.371:242): pid=833 cap_pi=0000003fffffffff cap_pp=0000003fffffffff cap_pe=0000003fffffffff cap_pa=0000000000000000
INTERPRET: type=CAPSET msg=audit(04/06/2017 05:05:02.371:242) : pid=833 \
cap_pi=chown,dac_override,dac_read_search,fowner,fsetid,kill,setgid,setuid,setpcap,linux_immutable,net_bind_service,net_broadcast,net_admin,net_raw,ipc_lock,ipc_owner,sys_module,sys_rawio,sys_chroot,sys_ptrace,sys_pacct,sys_admin,sys_boot,sys_nice,sys_resource,sys_time,sys_tty_config,mknod,lease,audit_write,audit_control,setfcap,mac_override,mac_admin,syslog,wake_alarm,block_suspend,audit_read \
cap_pp=chown,dac_override,dac_read_search,fowner,fsetid,kill,setgid,setuid,setpcap,linux_immutable,net_bind_service,net_broadcast,net_admin,net_raw,ipc_lock,ipc_owner,sys_module,sys_rawio,sys_chroot,sys_ptrace,sys_pacct,sys_admin,sys_boot,sys_nice,sys_resource,sys_time,sys_tty_config,mknod,lease,audit_write,audit_control,setfcap,mac_override,mac_admin,syslog,wake_alarm,block_suspend,audit_read \
cap_pe=chown,dac_override,dac_read_search,fowner,fsetid,kill,setgid,setuid,setpcap,linux_immutable,net_bind_service,net_broadcast,net_admin,net_raw,ipc_lock,ipc_owner,sys_module,sys_rawio,sys_chroot,sys_ptrace,sys_pacct,sys_admin,sys_boot,sys_nice,sys_resource,sys_time,sys_tty_config,mknod,lease,audit_write,audit_control,setfcap,mac_override,mac_admin,syslog,wake_alarm,block_suspend,audit_read \
cap_pa=none
Signed-off-by: Richard Guy Briggs <rgb(a)redhat.com>
---
auparse/typetab.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/auparse/typetab.h b/auparse/typetab.h
index be82796..42f3e82 100644
--- a/auparse/typetab.h
+++ b/auparse/typetab.h
@@ -89,6 +89,7 @@ _S(AUPARSE_TYPE_SESSION, "ses" )
_S(AUPARSE_TYPE_CAP_BITMAP, "cap_pi" )
_S(AUPARSE_TYPE_CAP_BITMAP, "cap_pe" )
_S(AUPARSE_TYPE_CAP_BITMAP, "cap_pp" )
+_S(AUPARSE_TYPE_CAP_BITMAP, "cap_pa" )
_S(AUPARSE_TYPE_CAP_BITMAP, "cap_fi" )
_S(AUPARSE_TYPE_CAP_BITMAP, "cap_fp" )
_S(AUPARSE_TYPE_CAP_BITMAP, "fp" )
@@ -97,9 +98,14 @@ _S(AUPARSE_TYPE_CAP_BITMAP, "fe" )
_S(AUPARSE_TYPE_CAP_BITMAP, "old_pp" )
_S(AUPARSE_TYPE_CAP_BITMAP, "old_pi" )
_S(AUPARSE_TYPE_CAP_BITMAP, "old_pe" )
+_S(AUPARSE_TYPE_CAP_BITMAP, "old_pa" )
_S(AUPARSE_TYPE_CAP_BITMAP, "new_pp" )
_S(AUPARSE_TYPE_CAP_BITMAP, "new_pi" )
_S(AUPARSE_TYPE_CAP_BITMAP, "new_pe" )
+_S(AUPARSE_TYPE_CAP_BITMAP, "pp" )
+_S(AUPARSE_TYPE_CAP_BITMAP, "pi" )
+_S(AUPARSE_TYPE_CAP_BITMAP, "pe" )
+_S(AUPARSE_TYPE_CAP_BITMAP, "pa" )
_S(AUPARSE_TYPE_NFPROTO, "family" )
_S(AUPARSE_TYPE_ICMPTYPE, "icmptype" )
_S(AUPARSE_TYPE_PROTOCOL, "proto" )
--
1.7.1
7 years, 1 month
Re: [PATCH 2/9] Implement containers as kernel objects
by Richard Guy Briggs
On 2017-05-22 17:22, David Howells wrote:
> A container is then a kernel object that contains the following things:
>
> (1) Namespaces.
>
> (2) A root directory.
>
> (3) A set of processes, including one designated as the 'init' process.
>
> A container is created and attached to a file descriptor by:
>
> int cfd = container_create(const char *name, unsigned int flags);
>
> this inherits all the namespaces of the parent container unless otherwise
> the mask calls for new namespaces.
>
> CONTAINER_NEW_FS_NS
> CONTAINER_NEW_EMPTY_FS_NS
> CONTAINER_NEW_CGROUP_NS [root only]
> CONTAINER_NEW_UTS_NS
> CONTAINER_NEW_IPC_NS
> CONTAINER_NEW_USER_NS
> CONTAINER_NEW_PID_NS
> CONTAINER_NEW_NET_NS
>
> Other flags include:
>
> CONTAINER_KILL_ON_CLOSE
> CONTAINER_CLOSE_ON_EXEC
Hi David,
I wanted to respond to this thread to attempt some constructive feedback,
better late than never. I had a look at your fsopen/fsmount() patchset(s) to
support this patchset which was interesting, but doesn't directly affect my
work. The primary patch of interest to the audit kernel folks (Paul Moore and
me) is this patch while the rest of the patchset is interesting, but not likely
to directly affect us. This patch has most of what we need to solve our
problem.
Paul and I agree that audit is going to have a difficult time identifying
containers or even namespaces without some change to the kernel. The audit
subsystem in the kernel needs at least a basic clue about which container
caused an event to be able to report this at the appropriate level and ignore
it at other levels to avoid a DoS.
We also agree that there will need to be some sort of trigger from userspace to
indicate the creation of a container and its allocated resources and we're not
really picky how that is done, such as a clone flag, a syscall or a sysfs write
(or even a read, I suppose), but there will need to be some permission
restrictions, obviously. (I'd like to see capabilities used for this by adding
a specific container bit to the capabilities bitmask.)
I doubt we will be able to accomodate all definitions or concepts of a
container in a timely fashion. We'll need to start somewhere with a minimum
definition so that we can get traction and actually move forward before another
compelling shared kernel microservice method leaves our entire community
behind. I'd like to declare that a container is a full set of cloned
namespaces, but this is inefficient, overly constricting and unnecessary for
our needs. If we could agree on a minimum definition of a container (which may
have only one specific cloned namespace) then we have something on which to
build. I could even see a container being defined by a trigger sent from
userspace about a process (task) from which all its children are considered to
be within that container, subject to further nesting.
In the simplest usable model for audit, if a container (definition implies and)
starts a PID namespace, then the container ID could simply be the container's
"init" process PID in the initial PID namespace. This assumes that as soon as
that process vanishes, that entire container and all its children are killed
off (which you've done). There may be some container orchestration systems
that don't use a unique PID namespace per container and that imposing this will
cause them challenges.
If containers have at minimum a unique mount namespace then the root path
dentry inode device and inode number could be used, but there are likely better
identifiers. Again, there may be container orchestrators that don't use a
unique mount namespace per container and that imposing this will cause
challenges.
I expect there are similar examples for each of the other namespaces.
If we could pick one namespace type for consensus for which each container has
a unique instance of that namespace, we could use the dev/ino tuple from that
namespace as had originally been suggested by Aristeu Rozanski more than 4
years ago as part of the set of namespace IDs. I had also attempted to
solve this problem by using the namespace' proc inode, then switched over to
generate a unique kernel serial number for each namespace and then went back to
namespace proc dev/ino once Al Viro implemented nsfs:
v1 https://lkml.org/lkml/2014/4/22/662
v2 https://lkml.org/lkml/2014/5/9/637
v3 https://lkml.org/lkml/2014/5/20/287
v4 https://lkml.org/lkml/2014/8/20/844
v5 https://lkml.org/lkml/2014/10/6/25
v6 https://lkml.org/lkml/2015/4/17/48
v7 https://lkml.org/lkml/2015/5/12/773
These patches don't use a container ID, but track all namespaces in use for an
event. This has the benefit of punting this tracking to userspace for some
other tool to analyse and determine to which container an event belongs.
This will use a lot of bandwidth in audit log files when a single
container ID that doesn't require nesting information to be complete
would be a much more efficient use of audit log bandwidth.
If we rely only on the setting of arbitrary container names from userspace,
then we must provide a map or tree back to the initial audit domain for that
running kernel to be able to differentiate between potentially identical
container names assigned in a nested container system. If we assign a
container serial number sequentially (atomic64_inc) from the kernel on request
from userspace like the sessionID and log the creation with all nsIDs and the
parent container serial number and/or container name, the nesting is clear due
to lack of ambiguity in potential duplicate names in nesting. If a container
serial number is used, the tree of inheritance of nested containers can be
rebuilt from the audit records showing what containers were spawned from what
parent.
As was suggested in one of the previous threads, if there are any events not
associated with a task (incoming network packets) we log the namespace ID and
then only concern ourselves with its container serial number or container name
once it becomes associated with a task at which point that tracking will be
more important anyways.
I'm not convinced that a userspace or kernel generated UUID is that useful
since they are large, not human readable and may not be globally unique given
the "pets vs cattle" direction we are going with potentially identical
conditions in hosts or containers spawning containers, but I see no need to
restrict them.
How do we deal with setns()? Once it is determined that action is permitted,
given the new combinaiton of namespaces and potential membership in a different
container, record the transition from one container to another including all
namespaces if the latter are a different subset than the target container
initial set.
David, this patch of yours provides most of what we need, but there is a danger
that some compromises (complete freedom of which namespaces to clone) will make
it unusable for our needs unless other mechanisms are added (internal container
serial number).
To answer Andy's inevitable question: We want to be able to attribute audit
events, whether they are generated by userspace or by a kernel event, to a
specific container. Since the kernel has no concept of a container, it needs
at least a rudimentary one to be able to track activity of kernel objects,
similar to what is already done with the loginuid (auid) and sessionid, neither
of which are kernel concepts, but the kernel keeps track of these as a service
to userspace. We are able to track activity by task, but we don't know when
that task or its namespaces (both resources) were allocated to a nebulous
"container". This resource tracking is required for security
certifications.
Thanks.
> Note that I've added a pointer to the current container to task_struct.
> This doesn't make the nsproxy pointer redundant as you can still make new
> namespaces with clone().
>
> I've also added a list_head to task_struct to form a list in the container
> of its member processes. This is convenient, but redundant since the code
> could iterate over all the tasks looking for ones that have a matching
> task->container.
>
>
> ==================
> FUTURE DEVELOPMENT
> ==================
>
> (1) Setting up the container.
>
> It should then be possible for the supervising process to modify the
> new container by:
>
> container_mount(int cfd,
> const char *source,
> const char *target, /* NULL -> root */
> const char *filesystemtype,
> unsigned long mountflags,
> const void *data);
> container_chroot(int cfd, const char *path);
> container_bind_mount_across(int cfd,
> const char *source,
> const char *target); /* NULL -> root */
> mkdirat(int cfd, const char *path, mode_t mode);
> mknodat(int cfd, const char *path, mode_t mode, dev_t dev);
> int fd = openat(int cfd, const char *path,
> unsigned int flags, mode_t mode);
> int fd = container_socket(int cfd, int domain, int type,
> int protocol);
>
> Opening a netlink socket inside the container should allow management
> of the container's network namespace.
>
> (2) Starting the container.
>
> Once all modifications are complete, the container's 'init' process
> can be started by:
>
> fork_into_container(int cfd);
>
> This precludes further external modification of the mount tree within
> the container. Before this point, the container is simply destroyed
> if the container fd is closed.
>
> (3) Waiting for the container to complete.
>
> The container fd can then be polled to wait for init process therein
> to complete and the exit code collected by:
>
> container_wait(int container_fd, int *_wstatus, unsigned int wait,
> struct rusage *rusage);
>
> The container and everything in it can be terminated or killed off:
>
> container_kill(int container_fd, int initonly, int signal);
>
> If 'init' dies, all other processes in the container are preemptively
> SIGKILL'd by the kernel.
>
> By default, if the container is active and its fd is closed, the
> container is left running and wil be cleaned up when its 'init' exits.
> The default can be changed with the CONTAINER_KILL_ON_CLOSE flag.
>
> (4) Supervising the container.
>
> Given that we have an fd attached to the container, we could make it
> such that the supervising process could monitor and override EPERM
> returns for mount and other privileged operations within the
> container.
>
> (5) Device restriction.
>
> Containers could come with a list of device IDs that the container is
> allowed to open. Perhaps a list major numbers, each with a bitmap of
> permitted minor numbers.
>
> (6) Per-container keyring.
>
> Each container could be given a per-container keyring for the holding
> of integrity keys and filesystem keys. This list would be only
> modifiable by the container's 'root' user and the supervisor process:
>
> container_add_key(const char *type, const char *description,
> const void *payload, size_t plen,
> int container_fd);
>
> The keys on the keyring would, however, be accessible/usable by all
> processes within the keyring.
>
>
> ===============
> EXAMPLE PROGRAM
> ===============
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/wait.h>
>
> #define CONTAINER_NEW_FS_NS 0x00000001 /* Dup current fs namespace */
> #define CONTAINER_NEW_EMPTY_FS_NS 0x00000002 /* Provide new empty fs namespace */
> #define CONTAINER_NEW_CGROUP_NS 0x00000004 /* Dup current cgroup namespace [priv] */
> #define CONTAINER_NEW_UTS_NS 0x00000008 /* Dup current uts namespace */
> #define CONTAINER_NEW_IPC_NS 0x00000010 /* Dup current ipc namespace */
> #define CONTAINER_NEW_USER_NS 0x00000020 /* Dup current user namespace */
> #define CONTAINER_NEW_PID_NS 0x00000040 /* Dup current pid namespace */
> #define CONTAINER_NEW_NET_NS 0x00000080 /* Dup current net namespace */
> #define CONTAINER_KILL_ON_CLOSE 0x00000100 /* Kill all member processes when fd closed */
> #define CONTAINER_FD_CLOEXEC 0x00000200 /* Close the fd on exec */
> #define CONTAINER__FLAG_MASK 0x000003ff
>
> static inline int container_create(const char *name, unsigned int mask)
> {
> return syscall(333, name, mask, 0, 0, 0);
> }
>
> static inline int fork_into_container(int containerfd)
> {
> return syscall(334, containerfd);
> }
>
> int main()
> {
> pid_t pid;
> int fd, ws;
>
> fd = container_create("foo-test",
> CONTAINER__FLAG_MASK & ~(
> CONTAINER_NEW_EMPTY_FS_NS |
> CONTAINER_NEW_CGROUP_NS));
> if (fd == -1) {
> perror("container_create");
> exit(1);
> }
>
> system("cat /proc/containers");
>
> switch ((pid = fork_into_container(fd))) {
> case -1:
> perror("fork_into_container");
> exit(1);
> case 0:
> close(fd);
> setenv("PS1", "container>", 1);
> execl("/bin/bash", "bash", NULL);
> perror("execl");
> exit(1);
> default:
> if (waitpid(pid, &ws, 0) < 0) {
> perror("waitpid");
> exit(1);
> }
> }
> close(fd);
> exit(0);
> }
>
> Signed-off-by: David Howells <dhowells(a)redhat.com>
> ---
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1
> arch/x86/entry/syscalls/syscall_64.tbl | 1
> fs/namespace.c | 5
> include/linux/container.h | 85 ++++++
> include/linux/init_task.h | 4
> include/linux/lsm_hooks.h | 21 +
> include/linux/sched.h | 3
> include/linux/security.h | 15 +
> include/linux/syscalls.h | 3
> include/uapi/linux/container.h | 28 ++
> include/uapi/linux/magic.h | 1
> init/Kconfig | 7
> kernel/Makefile | 2
> kernel/container.c | 462 ++++++++++++++++++++++++++++++++
> kernel/exit.c | 1
> kernel/fork.c | 7
> kernel/namespaces.h | 15 +
> kernel/nsproxy.c | 23 +-
> kernel/sys_ni.c | 4
> security/security.c | 13 +
> 20 files changed, 688 insertions(+), 13 deletions(-)
> create mode 100644 include/linux/container.h
> create mode 100644 include/uapi/linux/container.h
> create mode 100644 kernel/container.c
> create mode 100644 kernel/namespaces.h
>
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index abe6ea95e0e6..9ccd0f52f874 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -393,3 +393,4 @@
> 384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
> 385 i386 fsopen sys_fsopen
> 386 i386 fsmount sys_fsmount
> +387 i386 container_create sys_container_create
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 0977c5079831..dab92591511e 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -341,6 +341,7 @@
> 332 common statx sys_statx
> 333 common fsopen sys_fsopen
> 334 common fsmount sys_fsmount
> +335 common container_create sys_container_create
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 4e9ad16db79c..7e2d5fe5728b 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -28,6 +28,7 @@
> #include <linux/file.h>
> #include <linux/sched/task.h>
> #include <linux/sb_config.h>
> +#include <linux/container.h>
>
> #include "pnode.h"
> #include "internal.h"
> @@ -3510,6 +3511,10 @@ static void __init init_mount_tree(void)
>
> set_fs_pwd(current->fs, &root);
> set_fs_root(current->fs, &root);
> +#ifdef CONFIG_CONTAINERS
> + path_get(&root);
> + init_container.root = root;
> +#endif
> }
>
> void __init mnt_init(void)
> diff --git a/include/linux/container.h b/include/linux/container.h
> new file mode 100644
> index 000000000000..084ea9982fe6
> --- /dev/null
> +++ b/include/linux/container.h
> @@ -0,0 +1,85 @@
> +/* Container objects
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells(a)redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _LINUX_CONTAINER_H
> +#define _LINUX_CONTAINER_H
> +
> +#include <uapi/linux/container.h>
> +#include <linux/refcount.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/wait.h>
> +#include <linux/path.h>
> +#include <linux/seqlock.h>
> +
> +struct fs_struct;
> +struct nsproxy;
> +struct task_struct;
> +
> +/*
> + * The container object.
> + */
> +struct container {
> + char name[24];
> + refcount_t usage;
> + int exit_code; /* The exit code of 'init' */
> + const struct cred *cred; /* Creds for this container, including userns */
> + struct nsproxy *ns; /* This container's namespaces */
> + struct path root; /* The root of the container's fs namespace */
> + struct task_struct *init; /* The 'init' task for this container */
> + struct container *parent; /* Parent of this container. */
> + void *security; /* LSM data */
> + struct list_head members; /* Member processes, guarded with ->lock */
> + struct list_head child_link; /* Link in parent->children */
> + struct list_head children; /* Child containers */
> + wait_queue_head_t waitq; /* Someone waiting for init to exit waits here */
> + unsigned long flags;
> +#define CONTAINER_FLAG_INIT_STARTED 0 /* Init is started - certain ops now prohibited */
> +#define CONTAINER_FLAG_DEAD 1 /* Init has died */
> +#define CONTAINER_FLAG_KILL_ON_CLOSE 2 /* Kill init if container handle closed */
> + spinlock_t lock;
> + seqcount_t seq; /* Track changes in ->root */
> +};
> +
> +extern struct container init_container;
> +
> +#ifdef CONFIG_CONTAINERS
> +extern const struct file_operations containerfs_fops;
> +
> +extern int copy_container(unsigned long flags, struct task_struct *tsk,
> + struct container *container);
> +extern void exit_container(struct task_struct *tsk);
> +extern void put_container(struct container *c);
> +
> +static inline struct container *get_container(struct container *c)
> +{
> + refcount_inc(&c->usage);
> + return c;
> +}
> +
> +static inline bool is_container_file(struct file *file)
> +{
> + return file->f_op == &containerfs_fops;
> +}
> +
> +#else
> +
> +static inline int copy_container(unsigned long flags, struct task_struct *tsk,
> + struct container *container)
> +{ return 0; }
> +static inline void exit_container(struct task_struct *tsk) { }
> +static inline void put_container(struct container *c) {}
> +static inline struct container *get_container(struct container *c) { return NULL; }
> +static inline bool is_container_file(struct file *file) { return false; }
> +
> +#endif /* CONFIG_CONTAINERS */
> +
> +#endif /* _LINUX_CONTAINER_H */
> diff --git a/include/linux/init_task.h b/include/linux/init_task.h
> index e049526bc188..488385ad79db 100644
> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -9,6 +9,7 @@
> #include <linux/ipc.h>
> #include <linux/pid_namespace.h>
> #include <linux/user_namespace.h>
> +#include <linux/container.h>
> #include <linux/securebits.h>
> #include <linux/seqlock.h>
> #include <linux/rbtree.h>
> @@ -273,6 +274,9 @@ extern struct cred init_cred;
> .signal = &init_signals, \
> .sighand = &init_sighand, \
> .nsproxy = &init_nsproxy, \
> + .container = &init_container, \
> + .container_link.next = &init_container.members, \
> + .container_link.prev = &init_container.members, \
> .pending = { \
> .list = LIST_HEAD_INIT(tsk.pending.list), \
> .signal = {{0}}}, \
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index 7064c0c15386..7b0d484a6a25 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -1368,6 +1368,17 @@
> * @inode we wish to get the security context of.
> * @ctx is a pointer in which to place the allocated security context.
> * @ctxlen points to the place to put the length of @ctx.
> + *
> + * Security hooks for containers:
> + *
> + * @container_alloc:
> + * Permit creation of a new container and assign security data.
> + * @container: The new container.
> + *
> + * @container_free:
> + * Free security data attached to a container.
> + * @container: The container.
> + *
> * This is the main security structure.
> */
>
> @@ -1699,6 +1710,12 @@ union security_list_options {
> struct audit_context *actx);
> void (*audit_rule_free)(void *lsmrule);
> #endif /* CONFIG_AUDIT */
> +
> + /* Container management security hooks */
> +#ifdef CONFIG_CONTAINERS
> + int (*container_alloc)(struct container *container, unsigned int flags);
> + void (*container_free)(struct container *container);
> +#endif
> };
>
> struct security_hook_heads {
> @@ -1919,6 +1936,10 @@ struct security_hook_heads {
> struct list_head audit_rule_match;
> struct list_head audit_rule_free;
> #endif /* CONFIG_AUDIT */
> +#ifdef CONFIG_CONTAINERS
> + struct list_head container_alloc;
> + struct list_head container_free;
> +#endif /* CONFIG_CONTAINERS */
> };
>
> /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index eba196521562..d9b92a98f99f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -33,6 +33,7 @@ struct backing_dev_info;
> struct bio_list;
> struct blk_plug;
> struct cfs_rq;
> +struct container;
> struct fs_struct;
> struct futex_pi_state;
> struct io_context;
> @@ -741,6 +742,8 @@ struct task_struct {
>
> /* Namespaces: */
> struct nsproxy *nsproxy;
> + struct container *container;
> + struct list_head container_link;
>
> /* Signal handlers: */
> struct signal_struct *signal;
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 8c06e158c195..01bdf7637ec6 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -68,6 +68,7 @@ struct ctl_table;
> struct audit_krule;
> struct user_namespace;
> struct timezone;
> +struct container;
>
> /* These functions are in security/commoncap.c */
> extern int cap_capable(const struct cred *cred, struct user_namespace *ns,
> @@ -1672,6 +1673,20 @@ static inline void security_audit_rule_free(void *lsmrule)
> #endif /* CONFIG_SECURITY */
> #endif /* CONFIG_AUDIT */
>
> +#ifdef CONFIG_CONTAINERS
> +#ifdef CONFIG_SECURITY
> +int security_container_alloc(struct container *container, unsigned int flags);
> +void security_container_free(struct container *container);
> +#else
> +static inline int security_container_alloc(struct container *container,
> + unsigned int flags)
> +{
> + return 0;
> +}
> +static inline void security_container_free(struct container *container) {}
> +#endif
> +#endif /* CONFIG_CONTAINERS */
> +
> #ifdef CONFIG_SECURITYFS
>
> extern struct dentry *securityfs_create_file(const char *name, umode_t mode,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 07e4f775f24d..5a0324dd024c 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -908,5 +908,8 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
> asmlinkage long sys_fsopen(const char *fs_name, int containerfd, unsigned int flags);
> asmlinkage long sys_fsmount(int fsfd, int dfd, const char *path, unsigned int at_flags,
> unsigned int flags);
> +asmlinkage long sys_container_create(const char __user *name, unsigned int flags,
> + unsigned long spare3, unsigned long spare4,
> + unsigned long spare5);
>
> #endif
> diff --git a/include/uapi/linux/container.h b/include/uapi/linux/container.h
> new file mode 100644
> index 000000000000..43748099b28d
> --- /dev/null
> +++ b/include/uapi/linux/container.h
> @@ -0,0 +1,28 @@
> +/* Container UAPI
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells(a)redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#ifndef _UAPI_LINUX_CONTAINER_H
> +#define _UAPI_LINUX_CONTAINER_H
> +
> +
> +#define CONTAINER_NEW_FS_NS 0x00000001 /* Dup current fs namespace */
> +#define CONTAINER_NEW_EMPTY_FS_NS 0x00000002 /* Provide new empty fs namespace */
> +#define CONTAINER_NEW_CGROUP_NS 0x00000004 /* Dup current cgroup namespace */
> +#define CONTAINER_NEW_UTS_NS 0x00000008 /* Dup current uts namespace */
> +#define CONTAINER_NEW_IPC_NS 0x00000010 /* Dup current ipc namespace */
> +#define CONTAINER_NEW_USER_NS 0x00000020 /* Dup current user namespace */
> +#define CONTAINER_NEW_PID_NS 0x00000040 /* Dup current pid namespace */
> +#define CONTAINER_NEW_NET_NS 0x00000080 /* Dup current net namespace */
> +#define CONTAINER_KILL_ON_CLOSE 0x00000100 /* Kill all member processes when fd closed */
> +#define CONTAINER_FD_CLOEXEC 0x00000200 /* Close the fd on exec */
> +#define CONTAINER__FLAG_MASK 0x000003ff
> +
> +#endif /* _UAPI_LINUX_CONTAINER_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 88ae83492f7c..758705412b44 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -85,5 +85,6 @@
> #define BALLOON_KVM_MAGIC 0x13661366
> #define ZSMALLOC_MAGIC 0x58295829
> #define FS_FS_MAGIC 0x66736673
> +#define CONTAINERFS_MAGIC 0x636f6e74
>
> #endif /* __LINUX_MAGIC_H__ */
> diff --git a/init/Kconfig b/init/Kconfig
> index 1d3475fc9496..3a0ee88df6c8 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1288,6 +1288,13 @@ config NET_NS
> Allow user space to create what appear to be multiple instances
> of the network stack.
>
> +config CONTAINERS
> + bool "Container support"
> + default y
> + help
> + Allow userspace to create and manipulate containers as objects that
> + have namespaces and hold a set of processes.
> +
> endif # NAMESPACES
>
> config SCHED_AUTOGROUP
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 72aa080f91f0..117479b05fb1 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
> sysctl.o sysctl_binary.o capability.o ptrace.o user.o \
> signal.o sys.o kmod.o workqueue.o pid.o task_work.o \
> extable.o params.o \
> - kthread.o sys_ni.o nsproxy.o \
> + kthread.o sys_ni.o nsproxy.o container.o \
> notifier.o ksysfs.o cred.o reboot.o \
> async.o range.o smpboot.o ucount.o
>
> diff --git a/kernel/container.c b/kernel/container.c
> new file mode 100644
> index 000000000000..eef1566835eb
> --- /dev/null
> +++ b/kernel/container.c
> @@ -0,0 +1,462 @@
> +/* Implement container objects.
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells(a)redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include <linux/poll.h>
> +#include <linux/wait.h>
> +#include <linux/init_task.h>
> +#include <linux/fs.h>
> +#include <linux/fs_struct.h>
> +#include <linux/mount.h>
> +#include <linux/file.h>
> +#include <linux/container.h>
> +#include <linux/magic.h>
> +#include <linux/syscalls.h>
> +#include <linux/printk.h>
> +#include <linux/security.h>
> +#include "namespaces.h"
> +
> +struct container init_container = {
> + .name = ".init",
> + .usage = REFCOUNT_INIT(2),
> + .cred = &init_cred,
> + .ns = &init_nsproxy,
> + .init = &init_task,
> + .members.next = &init_task.container_link,
> + .members.prev = &init_task.container_link,
> + .children = LIST_HEAD_INIT(init_container.children),
> + .flags = (1 << CONTAINER_FLAG_INIT_STARTED),
> + .lock = __SPIN_LOCK_UNLOCKED(init_container.lock),
> + .seq = SEQCNT_ZERO(init_fs.seq),
> +};
> +
> +#ifdef CONFIG_CONTAINERS
> +
> +static struct vfsmount *containerfs_mnt __read_mostly;
> +
> +/*
> + * Drop a ref on a container and clear it if no longer in use.
> + */
> +void put_container(struct container *c)
> +{
> + struct container *parent;
> +
> + while (c && refcount_dec_and_test(&c->usage)) {
> + BUG_ON(!list_empty(&c->members));
> + if (c->ns)
> + put_nsproxy(c->ns);
> + path_put(&c->root);
> +
> + parent = c->parent;
> + if (parent) {
> + spin_lock(&parent->lock);
> + list_del(&c->child_link);
> + spin_unlock(&parent->lock);
> + }
> +
> + if (c->cred)
> + put_cred(c->cred);
> + security_container_free(c);
> + kfree(c);
> + c = parent;
> + }
> +}
> +
> +/*
> + * Allow the user to poll for the container dying.
> + */
> +static unsigned int containerfs_poll(struct file *file, poll_table *wait)
> +{
> + struct container *container = file->private_data;
> + unsigned int mask = 0;
> +
> + poll_wait(file, &container->waitq, wait);
> +
> + if (test_bit(CONTAINER_FLAG_DEAD, &container->flags))
> + mask |= POLLHUP;
> +
> + return mask;
> +}
> +
> +static int containerfs_release(struct inode *inode, struct file *file)
> +{
> + struct container *container = file->private_data;
> +
> + put_container(container);
> + return 0;
> +}
> +
> +const struct file_operations containerfs_fops = {
> + .poll = containerfs_poll,
> + .release = containerfs_release,
> +};
> +
> +/*
> + * Indicate the name we want to display the container file as.
> + */
> +static char *containerfs_dname(struct dentry *dentry, char *buffer, int buflen)
> +{
> + return dynamic_dname(dentry, buffer, buflen, "container:[%lu]",
> + d_inode(dentry)->i_ino);
> +}
> +
> +static const struct dentry_operations containerfs_dentry_operations = {
> + .d_dname = containerfs_dname,
> +};
> +
> +/*
> + * Allocate a container.
> + */
> +static struct container *alloc_container(const char __user *name)
> +{
> + struct container *c;
> + long len;
> + int ret;
> +
> + c = kzalloc(sizeof(struct container), GFP_KERNEL);
> + if (!c)
> + return ERR_PTR(-ENOMEM);
> +
> + INIT_LIST_HEAD(&c->members);
> + INIT_LIST_HEAD(&c->children);
> + init_waitqueue_head(&c->waitq);
> + spin_lock_init(&c->lock);
> + refcount_set(&c->usage, 1);
> +
> + ret = -EFAULT;
> + len = strncpy_from_user(c->name, name, sizeof(c->name));
> + if (len < 0)
> + goto err;
> + ret = -ENAMETOOLONG;
> + if (len >= sizeof(c->name))
> + goto err;
> + ret = -EINVAL;
> + if (strchr(c->name, '/'))
> + goto err;
> +
> + c->name[len] = 0;
> + return c;
> +
> +err:
> + kfree(c);
> + return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a supervisory file for a new container
> + */
> +static struct file *create_container_file(struct container *c)
> +{
> + struct inode *inode;
> + struct file *f;
> + struct path path;
> + int ret;
> +
> + inode = alloc_anon_inode(containerfs_mnt->mnt_sb);
> + if (!inode)
> + return ERR_PTR(-ENFILE);
> + inode->i_fop = &containerfs_fops;
> +
> + ret = -ENOMEM;
> + path.dentry = d_alloc_pseudo(containerfs_mnt->mnt_sb, &empty_name);
> + if (!path.dentry)
> + goto err_inode;
> + path.mnt = mntget(containerfs_mnt);
> +
> + d_instantiate(path.dentry, inode);
> +
> + f = alloc_file(&path, 0, &containerfs_fops);
> + if (IS_ERR(f)) {
> + ret = PTR_ERR(f);
> + goto err_file;
> + }
> +
> + f->private_data = c;
> + return f;
> +
> +err_file:
> + path_put(&path);
> + return ERR_PTR(ret);
> +
> +err_inode:
> + iput(inode);
> + return ERR_PTR(ret);
> +}
> +
> +static const struct super_operations containerfs_ops = {
> + .drop_inode = generic_delete_inode,
> + .destroy_inode = free_inode_nonrcu,
> + .statfs = simple_statfs,
> +};
> +
> +/*
> + * containerfs should _never_ be mounted by userland - too much of security
> + * hassle, no real gain from having the whole whorehouse mounted. So we don't
> + * need any operations on the root directory. However, we need a non-trivial
> + * d_name - container: will go nicely and kill the special-casing in procfs.
> + */
> +static struct dentry *containerfs_mount(struct file_system_type *fs_type,
> + int flags, const char *dev_name,
> + void *data)
> +{
> + return mount_pseudo(fs_type, "container:", &containerfs_ops,
> + &containerfs_dentry_operations, CONTAINERFS_MAGIC);
> +}
> +
> +static struct file_system_type container_fs_type = {
> + .name = "containerfs",
> + .mount = containerfs_mount,
> + .kill_sb = kill_anon_super,
> +};
> +
> +static int __init init_container_fs(void)
> +{
> + int ret;
> +
> + ret = register_filesystem(&container_fs_type);
> + if (ret < 0)
> + panic("Cannot register containerfs\n");
> +
> + containerfs_mnt = kern_mount(&container_fs_type);
> + if (IS_ERR(containerfs_mnt))
> + panic("Cannot mount containerfs: %ld\n",
> + PTR_ERR(containerfs_mnt));
> +
> + return 0;
> +}
> +
> +fs_initcall(init_container_fs);
> +
> +/*
> + * Handle fork/clone.
> + *
> + * A process inherits its parent's container. The first process into the
> + * container is its 'init' process and the life of everything else in there is
> + * dependent upon that.
> + */
> +int copy_container(unsigned long flags, struct task_struct *tsk,
> + struct container *container)
> +{
> + struct container *c = container ?: tsk->container;
> + int ret = -ECANCELED;
> +
> + spin_lock(&c->lock);
> +
> + if (!test_bit(CONTAINER_FLAG_DEAD, &c->flags)) {
> + list_add_tail(&tsk->container_link, &c->members);
> + get_container(c);
> + tsk->container = c;
> + if (!c->init) {
> + set_bit(CONTAINER_FLAG_INIT_STARTED, &c->flags);
> + c->init = tsk;
> + }
> + ret = 0;
> + }
> +
> + spin_unlock(&c->lock);
> + return ret;
> +}
> +
> +/*
> + * Remove a dead process from a container.
> + *
> + * If the 'init' process in a container dies, we kill off all the other
> + * processes in the container.
> + */
> +void exit_container(struct task_struct *tsk)
> +{
> + struct task_struct *p;
> + struct container *c = tsk->container;
> + struct siginfo si = {
> + .si_signo = SIGKILL,
> + .si_code = SI_KERNEL,
> + };
> +
> + spin_lock(&c->lock);
> +
> + list_del(&tsk->container_link);
> +
> + if (c->init == tsk) {
> + c->init = NULL;
> + c->exit_code = tsk->exit_code;
> + smp_wmb(); /* Order exit_code vs CONTAINER_DEAD. */
> + set_bit(CONTAINER_FLAG_DEAD, &c->flags);
> + wake_up_bit(&c->flags, CONTAINER_FLAG_DEAD);
> +
> + list_for_each_entry(p, &c->members, container_link) {
> + si.si_pid = task_tgid_vnr(p);
> + send_sig_info(SIGKILL, &si, p);
> + }
> + }
> +
> + spin_unlock(&c->lock);
> + put_container(c);
> +}
> +
> +/*
> + * Create some creds for the container. We don't want to pin things we don't
> + * have to, so drop all keyrings from the new cred. The LSM gets to audit the
> + * cred struct when security_container_alloc() is invoked.
> + */
> +static const struct cred *create_container_creds(unsigned int flags)
> +{
> + struct cred *new;
> + int ret;
> +
> + new = prepare_creds();
> + if (!new)
> + return ERR_PTR(-ENOMEM);
> +
> +#ifdef CONFIG_KEYS
> + key_put(new->thread_keyring);
> + new->thread_keyring = NULL;
> + key_put(new->process_keyring);
> + new->process_keyring = NULL;
> + key_put(new->session_keyring);
> + new->session_keyring = NULL;
> + key_put(new->request_key_auth);
> + new->request_key_auth = NULL;
> +#endif
> +
> + if (flags & CONTAINER_NEW_USER_NS) {
> + ret = create_user_ns(new);
> + if (ret < 0)
> + goto err;
> + new->euid = new->user_ns->owner;
> + new->egid = new->user_ns->group;
> + }
> +
> + new->fsuid = new->suid = new->uid = new->euid;
> + new->fsgid = new->sgid = new->gid = new->egid;
> + return new;
> +
> +err:
> + abort_creds(new);
> + return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a new container.
> + */
> +static struct container *create_container(const char *name, unsigned int flags)
> +{
> + struct container *parent, *c;
> + struct fs_struct *fs;
> + struct nsproxy *ns;
> + const struct cred *cred;
> + int ret;
> +
> + c = alloc_container(name);
> + if (IS_ERR(c))
> + return c;
> +
> + if (flags & CONTAINER_KILL_ON_CLOSE)
> + __set_bit(CONTAINER_FLAG_KILL_ON_CLOSE, &c->flags);
> +
> + cred = create_container_creds(flags);
> + if (IS_ERR(cred)) {
> + ret = PTR_ERR(cred);
> + goto err_cont;
> + }
> + c->cred = cred;
> +
> + ret = -ENOMEM;
> + fs = copy_fs_struct(current->fs);
> + if (!fs)
> + goto err_cont;
> +
> + ns = create_new_namespaces(
> + (flags & CONTAINER_NEW_FS_NS ? CLONE_NEWNS : 0) |
> + (flags & CONTAINER_NEW_CGROUP_NS ? CLONE_NEWCGROUP : 0) |
> + (flags & CONTAINER_NEW_UTS_NS ? CLONE_NEWUTS : 0) |
> + (flags & CONTAINER_NEW_IPC_NS ? CLONE_NEWIPC : 0) |
> + (flags & CONTAINER_NEW_PID_NS ? CLONE_NEWPID : 0) |
> + (flags & CONTAINER_NEW_NET_NS ? CLONE_NEWNET : 0),
> + current->nsproxy, cred->user_ns, fs);
> + if (IS_ERR(ns)) {
> + ret = PTR_ERR(ns);
> + goto err_fs;
> + }
> +
> + c->ns = ns;
> + c->root = fs->root;
> + c->seq = fs->seq;
> + fs->root.mnt = NULL;
> + fs->root.dentry = NULL;
> +
> + ret = security_container_alloc(c, flags);
> + if (ret < 0)
> + goto err_fs;
> +
> + parent = current->container;
> + get_container(parent);
> + c->parent = parent;
> + spin_lock(&parent->lock);
> + list_add_tail(&c->child_link, &parent->children);
> + spin_unlock(&parent->lock);
> + return c;
> +
> +err_fs:
> + free_fs_struct(fs);
> +err_cont:
> + put_container(c);
> + return ERR_PTR(ret);
> +}
> +
> +/*
> + * Create a new container object.
> + */
> +SYSCALL_DEFINE5(container_create,
> + const char __user *, name,
> + unsigned int, flags,
> + unsigned long, spare3,
> + unsigned long, spare4,
> + unsigned long, spare5)
> +{
> + struct container *c;
> + struct file *f;
> + int ret, fd;
> +
> + if (!name ||
> + flags & ~CONTAINER__FLAG_MASK ||
> + spare3 != 0 || spare4 != 0 || spare5 != 0)
> + return -EINVAL;
> + if ((flags & (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS)) ==
> + (CONTAINER_NEW_FS_NS | CONTAINER_NEW_EMPTY_FS_NS))
> + return -EINVAL;
> +
> + c = create_container(name, flags);
> + if (IS_ERR(c))
> + return PTR_ERR(c);
> +
> + f = create_container_file(c);
> + if (IS_ERR(f)) {
> + ret = PTR_ERR(f);
> + goto err_cont;
> + }
> +
> + ret = get_unused_fd_flags(flags & CONTAINER_FD_CLOEXEC ? O_CLOEXEC : 0);
> + if (ret < 0)
> + goto err_file;
> +
> + fd = ret;
> + fd_install(fd, f);
> + return fd;
> +
> +err_file:
> + fput(f);
> + return ret;
> +err_cont:
> + put_container(c);
> + return ret;
> +}
> +
> +#endif /* CONFIG_CONTAINERS */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 31b8617aee04..1ff87f7e40a2 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -875,6 +875,7 @@ void __noreturn do_exit(long code)
> if (group_dead)
> disassociate_ctty(1);
> exit_task_namespaces(tsk);
> + exit_container(tsk);
> exit_task_work(tsk);
> exit_thread(tsk);
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index aec6672d3f0e..ff2779426fe9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1728,9 +1728,12 @@ static __latent_entropy struct task_struct *copy_process(
> retval = copy_namespaces(clone_flags, p);
> if (retval)
> goto bad_fork_cleanup_mm;
> - retval = copy_io(clone_flags, p);
> + retval = copy_container(clone_flags, p, NULL);
> if (retval)
> goto bad_fork_cleanup_namespaces;
> + retval = copy_io(clone_flags, p);
> + if (retval)
> + goto bad_fork_cleanup_container;
> retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
> if (retval)
> goto bad_fork_cleanup_io;
> @@ -1918,6 +1921,8 @@ static __latent_entropy struct task_struct *copy_process(
> bad_fork_cleanup_io:
> if (p->io_context)
> exit_io_context(p);
> +bad_fork_cleanup_container:
> + exit_container(p);
> bad_fork_cleanup_namespaces:
> exit_task_namespaces(p);
> bad_fork_cleanup_mm:
> diff --git a/kernel/namespaces.h b/kernel/namespaces.h
> new file mode 100644
> index 000000000000..c44e3cf0e254
> --- /dev/null
> +++ b/kernel/namespaces.h
> @@ -0,0 +1,15 @@
> +/* Local namespaces defs
> + *
> + * Copyright (C) 2017 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells(a)redhat.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public Licence
> + * as published by the Free Software Foundation; either version
> + * 2 of the Licence, or (at your option) any later version.
> + */
> +
> +extern struct nsproxy *create_new_namespaces(unsigned long flags,
> + struct nsproxy *nsproxy,
> + struct user_namespace *user_ns,
> + struct fs_struct *new_fs);
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index f6c5d330059a..4bb5184b3a80 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -27,6 +27,7 @@
> #include <linux/syscalls.h>
> #include <linux/cgroup.h>
> #include <linux/perf_event.h>
> +#include "namespaces.h"
>
> static struct kmem_cache *nsproxy_cachep;
>
> @@ -61,8 +62,8 @@ static inline struct nsproxy *create_nsproxy(void)
> * Return the newly created nsproxy. Do not attach this to the task,
> * leave it to the caller to do proper locking and attach it to task.
> */
> -static struct nsproxy *create_new_namespaces(unsigned long flags,
> - struct task_struct *tsk, struct user_namespace *user_ns,
> +struct nsproxy *create_new_namespaces(unsigned long flags,
> + struct nsproxy *nsproxy, struct user_namespace *user_ns,
> struct fs_struct *new_fs)
> {
> struct nsproxy *new_nsp;
> @@ -72,39 +73,39 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
> if (!new_nsp)
> return ERR_PTR(-ENOMEM);
>
> - new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
> + new_nsp->mnt_ns = copy_mnt_ns(flags, nsproxy->mnt_ns, user_ns, new_fs);
> if (IS_ERR(new_nsp->mnt_ns)) {
> err = PTR_ERR(new_nsp->mnt_ns);
> goto out_ns;
> }
>
> - new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
> + new_nsp->uts_ns = copy_utsname(flags, user_ns, nsproxy->uts_ns);
> if (IS_ERR(new_nsp->uts_ns)) {
> err = PTR_ERR(new_nsp->uts_ns);
> goto out_uts;
> }
>
> - new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
> + new_nsp->ipc_ns = copy_ipcs(flags, user_ns, nsproxy->ipc_ns);
> if (IS_ERR(new_nsp->ipc_ns)) {
> err = PTR_ERR(new_nsp->ipc_ns);
> goto out_ipc;
> }
>
> new_nsp->pid_ns_for_children =
> - copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children);
> + copy_pid_ns(flags, user_ns, nsproxy->pid_ns_for_children);
> if (IS_ERR(new_nsp->pid_ns_for_children)) {
> err = PTR_ERR(new_nsp->pid_ns_for_children);
> goto out_pid;
> }
>
> new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> - tsk->nsproxy->cgroup_ns);
> + nsproxy->cgroup_ns);
> if (IS_ERR(new_nsp->cgroup_ns)) {
> err = PTR_ERR(new_nsp->cgroup_ns);
> goto out_cgroup;
> }
>
> - new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
> + new_nsp->net_ns = copy_net_ns(flags, user_ns, nsproxy->net_ns);
> if (IS_ERR(new_nsp->net_ns)) {
> err = PTR_ERR(new_nsp->net_ns);
> goto out_net;
> @@ -162,7 +163,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> (CLONE_NEWIPC | CLONE_SYSVSEM))
> return -EINVAL;
>
> - new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
> + new_ns = create_new_namespaces(flags, tsk->nsproxy, user_ns, tsk->fs);
> if (IS_ERR(new_ns))
> return PTR_ERR(new_ns);
>
> @@ -203,7 +204,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
> if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> return -EPERM;
>
> - *new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
> + *new_nsp = create_new_namespaces(unshare_flags, current->nsproxy, user_ns,
> new_fs ? new_fs : current->fs);
> if (IS_ERR(*new_nsp)) {
> err = PTR_ERR(*new_nsp);
> @@ -251,7 +252,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
> if (nstype && (ns->ops->type != nstype))
> goto out;
>
> - new_nsproxy = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
> + new_nsproxy = create_new_namespaces(0, tsk->nsproxy, current_user_ns(), tsk->fs);
> if (IS_ERR(new_nsproxy)) {
> err = PTR_ERR(new_nsproxy);
> goto out;
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index a0fe764bd5dd..99b1e1f58d05 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -262,3 +262,7 @@ cond_syscall(sys_pkey_free);
> /* fd-based mount */
> cond_syscall(sys_fsopen);
> cond_syscall(sys_fsmount);
> +
> +/* Containers */
> +cond_syscall(sys_container_create);
> +
> diff --git a/security/security.c b/security/security.c
> index f4136ca5cb1b..b5c5b5ae1266 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -1668,3 +1668,16 @@ int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule,
> actx);
> }
> #endif /* CONFIG_AUDIT */
> +
> +#ifdef CONFIG_CONTAINERS
> +
> +int security_container_alloc(struct container *container, unsigned int flags)
> +{
> + return call_int_hook(container_alloc, 0, container, flags);
> +}
> +
> +void security_container_free(struct container *container)
> +{
> + call_void_hook(container_free, container);
> +}
> +#endif /* CONFIG_CONTAINERS */
- RGB
--
Richard Guy Briggs <rgb(a)redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
7 years, 1 month
[PATCH V3 00/10] capabilities: do not audit log BPRM_FCAPS on set*id
by Richard Guy Briggs
The audit subsystem is adding a BPRM_FCAPS record when auditing setuid
application execution (SYSCALL execve). This is not expected as it was
supposed to be limited to when the file system actually had capabilities
in an extended attribute. It lists all capabilities making the event
really ugly to parse what is happening. The PATH record correctly
records the setuid bit and owner. Suppress the BPRM_FCAPS record on
set*id.
See: https://github.com/linux-audit/audit-kernel/issues/16
The first to eighth just massage the logic to make it easier to understand.
Some of them could be squashed together.
The patch that resolves this issue is the ninth.
It would be possible to address the original issue with a change of
"!uid_eq(new->euid, root_uid) || !uid_eq(new->uid, root_uid)"
to
"!(uid_eq(new->euid, root_uid) || uid_eq(new->uid, root_uid))"
but it took me long enough to understand this logic that I don't think I'd be
doing any favours by leaving it this difficult to understand.
The final patch attempts to address all the conditions that need logging based
on mailing list conversations, recoginizing there is probably some duplication
in the logic.
Richard Guy Briggs (10):
capabilities: factor out cap_bprm_set_creds privileged root
capabilities: intuitive names for cap gain status
capabilities: rename has_cap to has_fcap
capabilities: use root_priveleged inline to clarify logic
capabilities: use intuitive names for id changes
capabilities: move audit log decision to function
capabilities: remove a layer of conditional logic
capabilities: invert logic for clarity
capabilities: fix logic for effective root or real root
capabilities: audit log other surprising conditions
security/commoncap.c | 166 ++++++++++++++++++++++++++++++++------------------
1 files changed, 107 insertions(+), 59 deletions(-)
7 years, 1 month
passwd and USER_CHAUTHTOK
by Maupertuis Philippe
Hi
On a new redhat 7.4, passwd -S to check the status of a user generates the following event :
node=xxxxx type=USER_CHAUTHTOK msg=audit(28/08/17 16:34:18.632:54145) : pid=31134 uid=root auid=xxxxx ses=3866 msg='op=password status displayed for user id=ftp exe=/usr/bin/passwd hostname= xxxxx addr=? terminal=pts/1 res=success'
According to https://github.com/linux-audit/audit-documentation/wiki/SPEC-User-Account... USER_CHAUTHTOK means that the user has successfully changed his password.
In that case no change were done, only a query as it appears in the msg field
The text format is even more disturbing :
On xxxxx at 16:34:18 28/08/17 xxxxx, acting as root, successfully changed-password using /usr/bin/passwd
The real action and the target user (ftp) is entirely lost in the text format.
I would say that this message should not have been generated in the first place.
If I really change a user password by passwd games , I get :
node=xxxxx type=USER_CHAUTHTOK msg=audit(28/08/17 17:04:36.683:54299) : pid=774 uid=root auid=xxxxx ses=3866 msg='op=change password id=games exe=/usr/bin/passwd hostname=xxxxx addr=? terminal=pts/1 res=success'
and in the text format :
On xxxxx at 17:04:36 28/08/17 xxxxx, acting as root, successfully changed-password games using /usr/bin/passwd
On xxxxx at 17:04:36 28/08/17 xxxxx, acting as root, successfully changed-password using /usr/bin/passwd
This time the first line describes accurately what happened but I find the second one misleading since it is really the same command and not an additional change.
Please let me know if I missed something.
Philippe
!!!*************************************************************************************
"Ce message et les pi?ces jointes sont confidentiels et r?serv?s ? l'usage exclusif de ses destinataires. Il peut ?galement ?tre prot?g? par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir imm?diatement l'exp?diteur et de le d?truire. L'int?grit? du message ne pouvant ?tre assur?e sur Internet, la responsabilit? de Worldline ne pourra ?tre recherch?e quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'exp?diteur ne donne aucune garantie ? cet ?gard et sa responsabilit? ne saurait ?tre recherch?e pour tout dommage r?sultant d'un virus transmis.
This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Worldline liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.!!!"
7 years, 2 months