[PATCH][RFC] audit: set wait time to zero when audit failed

[PATCH ghau51/ghau40 v7 00/12] add...

RFC(V4): Audit Kernel Container IDs

Li RongQing

Wednesday, 11 September 2019 Wed, 11 Sep '19

10:19 p.m.

if audit_log_start failed because queue is full, kauditd is waiting the receiving queue empty, but no receiver, a task will be forced to wait 60 seconds for each audited syscall, and it will be hang for a very long time so at this condition, set the wait time to zero to reduce wait, and restore wait time when audit works again it partially restore the commit 3197542482df ("audit: rework audit_log_start()") Signed-off-by: Li RongQing <lirongqing(a)baidu.com> Signed-off-by: Liang ZhiCheng <liangzhicheng(a)baidu.com> --- reboot is taking a very long time on my machine(centos 6u4 +kernel 5.3) since TIF_SYSCALL_AUDIT is set by default, and when reboot, userspace process which receiver audit message , will be killed, and lead to that no user drain the audit queue git bitsect show it is caused by 3197542482df ("audit: rework audit_log_start()") kernel/audit.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/audit.c b/kernel/audit.c index da8dc0db5bd3..6de23599fd43 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -119,6 +119,7 @@ static u32 audit_rate_limit; * When set to zero, this means unlimited. */ static u32 audit_backlog_limit = 64; #define AUDIT_BACKLOG_WAIT_TIME (60 * HZ) +static u32 audit_backlog_wait_time_master = AUDIT_BACKLOG_WAIT_TIME; static u32 audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; /* The identity of the user shutting down the audit system. */ @@ -435,7 +436,7 @@ static int audit_set_backlog_limit(u32 limit) static int audit_set_backlog_wait_time(u32 timeout) { return audit_do_config_change("audit_backlog_wait_time", - &audit_backlog_wait_time, timeout); + &audit_backlog_wait_time_master, timeout); } static int audit_set_enabled(u32 state) @@ -1202,7 +1203,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) s.lost = atomic_read(&audit_lost); s.backlog = skb_queue_len(&audit_queue); s.feature_bitmap = AUDIT_FEATURE_BITMAP_ALL; - s.backlog_wait_time = audit_backlog_wait_time; + s.backlog_wait_time = audit_backlog_wait_time_master; audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &s, sizeof(s)); break; } @@ -1785,11 +1786,15 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, skb_queue_len(&audit_queue), audit_backlog_limit); audit_log_lost("backlog limit exceeded"); + audit_backlog_wait_time = 0; return NULL; } } } + if (audit_backlog_wait_time != audit_backlog_wait_time_master) + audit_backlog_wait_time = audit_backlog_wait_time_master; + ab = audit_buffer_alloc(ctx, gfp_mask, type); if (!ab) { audit_log_lost("out of memory in audit_log_start"); -- 2.16.2

Show replies by date

Paul Moore

Thursday, 12 September Thu, 12 Sep

8:01 a.m.

On Wed, Sep 11, 2019 at 11:19 PM Li RongQing <lirongqing(a)baidu.com> wrote:

...

This is typically solved by increasing the backlog using the "audit_backlog_limit" kernel parameter (link to the docs below). You might also want to investigate what is generating some many audit records prior to starting the audit daemon. * https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

...

diff --git a/kernel/audit.c b/kernel/audit.c index da8dc0db5bd3..6de23599fd43 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -119,6 +119,7 @@ static u32 audit_rate_limit; * When set to zero, this means unlimited. */ static u32 audit_backlog_limit = 64; #define AUDIT_BACKLOG_WAIT_TIME (60 * HZ) +static u32 audit_backlog_wait_time_master = AUDIT_BACKLOG_WAIT_TIME; static u32 audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; /* The identity of the user shutting down the audit system. */ @@ -435,7 +436,7 @@ static int audit_set_backlog_limit(u32 limit) static int audit_set_backlog_wait_time(u32 timeout) { return audit_do_config_change("audit_backlog_wait_time", - &audit_backlog_wait_time, timeout); + &audit_backlog_wait_time_master, timeout); } static int audit_set_enabled(u32 state) @@ -1202,7 +1203,7 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) s.lost = atomic_read(&audit_lost); s.backlog = skb_queue_len(&audit_queue); s.feature_bitmap = AUDIT_FEATURE_BITMAP_ALL; - s.backlog_wait_time = audit_backlog_wait_time; + s.backlog_wait_time = audit_backlog_wait_time_master; audit_send_reply(skb, seq, AUDIT_GET, 0, 0, &s, sizeof(s)); break; } @@ -1785,11 +1786,15 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, skb_queue_len(&audit_queue), audit_backlog_limit); audit_log_lost("backlog limit exceeded"); + audit_backlog_wait_time = 0; return NULL; } } } + if (audit_backlog_wait_time != audit_backlog_wait_time_master) + audit_backlog_wait_time = audit_backlog_wait_time_master; + ab = audit_buffer_alloc(ctx, gfp_mask, type); if (!ab) { audit_log_lost("out of memory in audit_log_start"); -- 2.16.2

-- paul moore www.paul-moore.com

Li,Rongqing

Sunday, 15 September Sun, 15 Sep

9:55 p.m.

New subject: 答复: [PATCH][RFC] audit: set wait time to zero when audit failed

...

> > if audit_log_start failed because queue is full, kauditd is waiting > the receiving queue empty, but no receiver, a task will be forced to > wait 60 seconds for each audited syscall, and it will be hang for a > very long time > > so at this condition, set the wait time to zero to reduce wait, and > restore wait time when audit works again > > it partially restore the commit 3197542482df ("audit: rework > audit_log_start()") > > Signed-off-by: Li RongQing <lirongqing(a)baidu.com> > Signed-off-by: Liang ZhiCheng <liangzhicheng(a)baidu.com> > --- > reboot is taking a very long time on my machine(centos 6u4 +kernel > 5.3) since TIF_SYSCALL_AUDIT is set by default, and when reboot, > userspace process which receiver audit message , will be killed, and > lead to that no user drain the audit queue > > git bitsect show it is caused by 3197542482df ("audit: rework > audit_log_start()") > > kernel/audit.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) This is typically solved by increasing the backlog using the "audit_backlog_limit" kernel parameter (link to the docs below).

It should be able to avoid my issue, but the default behaviors does not working for me; And not all have enough knowledge about audit, who maybe spend lots of effort to find the root cause, and estimate how large should be "audit_backlog_limit"

...

You might also want to investigate what is generating some many audit records prior to starting the audit daemon.

It is /sbin/readahead-collector, in fact, we stop the auditd; We are doing a reboot test, which rebooting machine continue to test hardware/software. it is same as below: auditctl -a always,exit -S all -F pid='xxx' kill -s 19 `pidof auditd` then the audited task will be hung -RongQing

Paul Moore

Monday, 16 September Mon, 16 Sep

5:52 p.m.

On Sun, Sep 15, 2019 at 10:55 PM Li,Rongqing <lirongqing(a)baidu.com> wrote:

...

> > if audit_log_start failed because queue is full, kauditd is waiting > > the receiving queue empty, but no receiver, a task will be forced to > > wait 60 seconds for each audited syscall, and it will be hang for a > > very long time > > > > so at this condition, set the wait time to zero to reduce wait, and > > restore wait time when audit works again > > > > it partially restore the commit 3197542482df ("audit: rework > > audit_log_start()") > > > > Signed-off-by: Li RongQing <lirongqing(a)baidu.com> > > Signed-off-by: Liang ZhiCheng <liangzhicheng(a)baidu.com> > > --- > > reboot is taking a very long time on my machine(centos 6u4 +kernel > > 5.3) since TIF_SYSCALL_AUDIT is set by default, and when reboot, > > userspace process which receiver audit message , will be killed, and > > lead to that no user drain the audit queue > > > > git bitsect show it is caused by 3197542482df ("audit: rework > > audit_log_start()") > > > > kernel/audit.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > This is typically solved by increasing the backlog using the "audit_backlog_limit" > kernel parameter (link to the docs below). It should be able to avoid my issue, but the default behaviors does not working for me; And not all have enough knowledge about audit, who maybe spend lots of effort to find the root cause, and estimate how large should be "audit_backlog_limit"

The pause/sleep behavior is desired behavior and is intended to help kauditd/auditd process the audit backlog on a busy system. If we didn't sleep the current process and give kauditd/auditd a chance to flush the backlog when it was full, a lot of bad things could happen with respect to audit. We generally select the backlog limit so that this is not a problem for most systems, although there will always be edge cases where the default does not work well; it is impossible to pick defaults that work well for every case. If you are not using audit, you can always disable it via the kernel command line, or at runtime (look at what Fedora does).

...

> You might also want to investigate > what is generating some many audit records prior to starting the audit > daemon. It is /sbin/readahead-collector, in fact, we stop the auditd; We are doing a reboot test, which rebooting machine continue to test hardware/software. it is same as below: auditctl -a always,exit -S all -F pid='xxx' kill -s 19 `pidof auditd` then the audited task will be hung

So you are seeing this problem only when you run a test, or did you provide this as a reproducer? -- paul moore www.paul-moore.com

Li,Rongqing

8:08 p.m.

New subject: 答复: [PATCH][RFC] audit: set wait time to zero when audit failed

...

-----邮件原件----- 发件人: Paul Moore [mailto:paul@paul-moore.com] 发送时间: 2019年9月17日 6:52 收件人: Li,Rongqing <lirongqing(a)baidu.com> 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed On Sun, Sep 15, 2019 at 10:55 PM Li,Rongqing <lirongqing(a)baidu.com> wrote: > > > if audit_log_start failed because queue is full, kauditd is > > > waiting the receiving queue empty, but no receiver, a task will be > > > forced to wait 60 seconds for each audited syscall, and it will be > > > hang for a very long time > > > > > > so at this condition, set the wait time to zero to reduce wait, > > > and restore wait time when audit works again > > > > > > it partially restore the commit 3197542482df ("audit: rework > > > audit_log_start()") > > > > > > Signed-off-by: Li RongQing <lirongqing(a)baidu.com> > > > Signed-off-by: Liang ZhiCheng <liangzhicheng(a)baidu.com> > > > --- > > > reboot is taking a very long time on my machine(centos 6u4 +kernel > > > 5.3) since TIF_SYSCALL_AUDIT is set by default, and when reboot, > > > userspace process which receiver audit message , will be killed, > > > and lead to that no user drain the audit queue > > > > > > git bitsect show it is caused by 3197542482df ("audit: rework > > > audit_log_start()") > > > > > > kernel/audit.c | 9 +++++++-- > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > This is typically solved by increasing the backlog using the "audit_backlog_limit" > > kernel parameter (link to the docs below). > > It should be able to avoid my issue, but the default behaviors does not working for me; And not all have enough knowledge about audit, who maybe spend lots of effort to find the root cause, and estimate how large should be "audit_backlog_limit" The pause/sleep behavior is desired behavior and is intended to help kauditd/auditd process the audit backlog on a busy system. If we didn't sleep the current process and give kauditd/auditd a chance to flush the backlog when it was full, a lot of bad things could happen with respect to audit. We generally select the backlog limit so that this is not a problem for most systems, although there will always be edge cases where the default does not work well; it is impossible to pick defaults that work well for every case.

I just want to it as before 3197542482df ("audit: rework audit_log_start()"), wait 60 seconds once if auditd/readaheaad-collector have some problem to drain the audit backlog. And once the auditd/readahead-collector recovers, restore the wait time to 60 seconds

...

If you are not using audit, you can always disable it via the kernel command line, or at runtime (look at what Fedora does). > > You might also want to investigate > > what is generating some many audit records prior to starting the > > audit daemon. > > It is /sbin/readahead-collector, in fact, we stop the auditd; We are doing a reboot test, which rebooting machine continue to test hardware/software. > > it is same as below: > auditctl -a always,exit -S all -F pid='xxx' > kill -s 19 `pidof auditd` > > then the audited task will be hung So you are seeing this problem only when you run a test, or did you provide this as a reproducer?

auditctl -a always,exit -S all -F ppid=`pidof sshd` kill -s 19 `pidof auditd` ssh root(a)127.0.0.1 then ssh will be hung forever -Li RongQing

...

-- paul moore www.paul-moore.com

Paul Moore

Tuesday, 17 September Tue, 17 Sep

2:16 p.m.

On Mon, Sep 16, 2019 at 9:08 PM Li,Rongqing <lirongqing(a)baidu.com> wrote:

...

> -----邮件原件----- > 发件人: Paul Moore [mailto:paul@paul-moore.com] > 发送时间: 2019年9月17日 6:52 > 收件人: Li,Rongqing <lirongqing(a)baidu.com> > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed > > On Sun, Sep 15, 2019 at 10:55 PM Li,Rongqing <lirongqing(a)baidu.com> wrote: > > > > if audit_log_start failed because queue is full, kauditd is > > > > waiting the receiving queue empty, but no receiver, a task will be > > > > forced to wait 60 seconds for each audited syscall, and it will be > > > > hang for a very long time > > > > > > > > so at this condition, set the wait time to zero to reduce wait, > > > > and restore wait time when audit works again > > > > > > > > it partially restore the commit 3197542482df ("audit: rework > > > > audit_log_start()") > > > > > > > > Signed-off-by: Li RongQing <lirongqing(a)baidu.com> > > > > Signed-off-by: Liang ZhiCheng <liangzhicheng(a)baidu.com> > > > > --- > > > > reboot is taking a very long time on my machine(centos 6u4 +kernel > > > > 5.3) since TIF_SYSCALL_AUDIT is set by default, and when reboot, > > > > userspace process which receiver audit message , will be killed, > > > > and lead to that no user drain the audit queue > > > > > > > > git bitsect show it is caused by 3197542482df ("audit: rework > > > > audit_log_start()") > > > > > > > > kernel/audit.c | 9 +++++++-- > > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > > > This is typically solved by increasing the backlog using the > "audit_backlog_limit" > > > kernel parameter (link to the docs below). > > > > It should be able to avoid my issue, but the default behaviors does not > working for me; And not all have enough knowledge about audit, who maybe > spend lots of effort to find the root cause, and estimate how large should be > "audit_backlog_limit" > > The pause/sleep behavior is desired behavior and is intended to help > kauditd/auditd process the audit backlog on a busy system. If we didn't sleep > the current process and give kauditd/auditd a chance to flush the backlog when > it was full, a lot of bad things could happen with respect to audit. We > generally select the backlog limit so that this is not a problem for most systems, > although there will always be edge cases where the default does not work well; > it is impossible to pick defaults that work well for every case. > I just want to it as before 3197542482df ("audit: rework audit_log_start()"), wait 60 seconds once if auditd/readaheaad-collector have some problem to drain the audit backlog.

The patch you mention fixed what was deemed to be buggy behavior; as mentioned previously in this thread I see no good reason to go back to the old behavior.

...

> If you are not using audit, you can always disable it via the kernel command line, > or at runtime (look at what Fedora does). > > > > You might also want to investigate > > > what is generating some many audit records prior to starting the > > > audit daemon. > > > > It is /sbin/readahead-collector, in fact, we stop the auditd; We are doing a > reboot test, which rebooting machine continue to test hardware/software. > > > > it is same as below: > > auditctl -a always,exit -S all -F pid='xxx' > > kill -s 19 `pidof auditd` > > > > then the audited task will be hung > > So you are seeing this problem only when you run a test, or did you provide this > as a reproducer? auditctl -a always,exit -S all -F ppid=`pidof sshd` kill -s 19 `pidof auditd` ssh root(a)127.0.0.1 then ssh will be hung forever

That is expected behavior. You are putting a massive audit load on the system by telling the kernel to audit every syscall that sshd makes, then you are intentionally killing the audit daemon and attempting to ssh into the system. The proper fix(es) here would be to 1) set reasonable audit rules and/or 2) use an init system that monitors and restarts auditd when it fails (systemd has this capability, I believe some others do as well). -- paul moore www.paul-moore.com

Li,Rongqing

8:07 p.m.

New subject: 答复: [PATCH][RFC] audit: set wait time to zero when audit failed

...

-----邮件原件----- 发件人: Paul Moore [mailto:paul@paul-moore.com] 发送时间: 2019年9月18日 3:17 收件人: Li,Rongqing <lirongqing(a)baidu.com> 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed On Mon, Sep 16, 2019 at 9:08 PM Li,Rongqing <lirongqing(a)baidu.com> wrote: > > -----邮件原件----- > > 发件人: Paul Moore [mailto:paul@paul-moore.com] > > 发送时间: 2019年9月17日 6:52 > > 收件人: Li,Rongqing <lirongqing(a)baidu.com> > > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com > > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed > > > > On Sun, Sep 15, 2019 at 10:55 PM Li,Rongqing <lirongqing(a)baidu.com> wrote: > > > > > if audit_log_start failed because queue is full, kauditd is > > > > > waiting the receiving queue empty, but no receiver, a task > > > > > will be forced to wait 60 seconds for each audited syscall, > > > > > and it will be hang for a very long time > > > > > > > > > > so at this condition, set the wait time to zero to reduce > > > > > wait, and restore wait time when audit works again > > > > > > > > > > it partially restore the commit 3197542482df ("audit: rework > > > > > audit_log_start()") > > > > > > > > > > Signed-off-by: Li RongQing <lirongqing(a)baidu.com> > > > > > Signed-off-by: Liang ZhiCheng <liangzhicheng(a)baidu.com> > > > > > --- > > > > > reboot is taking a very long time on my machine(centos 6u4 > > > > > +kernel > > > > > 5.3) since TIF_SYSCALL_AUDIT is set by default, and when > > > > > reboot, userspace process which receiver audit message , will > > > > > be killed, and lead to that no user drain the audit queue > > > > > > > > > > git bitsect show it is caused by 3197542482df ("audit: rework > > > > > audit_log_start()") > > > > > > > > > > kernel/audit.c | 9 +++++++-- > > > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > > > > > This is typically solved by increasing the backlog using the > > "audit_backlog_limit" > > > > kernel parameter (link to the docs below). > > > > > > It should be able to avoid my issue, but the default behaviors > > > does not > > working for me; And not all have enough knowledge about audit, who > > maybe spend lots of effort to find the root cause, and estimate how > > large should be "audit_backlog_limit" > > > > The pause/sleep behavior is desired behavior and is intended to help > > kauditd/auditd process the audit backlog on a busy system. If we > > didn't sleep the current process and give kauditd/auditd a chance to > > flush the backlog when it was full, a lot of bad things could happen > > with respect to audit. We generally select the backlog limit so > > that this is not a problem for most systems, although there will > > always be edge cases where the default does not work well; it is impossible to pick defaults that work well for every case. > > > > I just want to it as before 3197542482df ("audit: rework > audit_log_start()"), wait 60 seconds once if > auditd/readaheaad-collector have some problem to drain the audit backlog. The patch you mention fixed what was deemed to be buggy behavior; as mentioned previously in this thread I see no good reason to go back to the old behavior. > > If you are not using audit, you can always disable it via the kernel > > command line, or at runtime (look at what Fedora does). > > > > > > You might also want to investigate what is generating some many > > > > audit records prior to starting the audit daemon. > > > > > > It is /sbin/readahead-collector, in fact, we stop the auditd; We > > > are doing a > > reboot test, which rebooting machine continue to test hardware/software. > > > > > > it is same as below: > > > auditctl -a always,exit -S all -F pid='xxx' > > > kill -s 19 `pidof auditd` > > > > > > then the audited task will be hung > > > > So you are seeing this problem only when you run a test, or did you > > provide this as a reproducer? > > auditctl -a always,exit -S all -F ppid=`pidof sshd` kill -s 19 `pidof > auditd` ssh root(a)127.0.0.1 > > then ssh will be hung forever That is expected behavior. You are putting a massive audit load on the system by telling the kernel to audit every syscall that sshd makes, then you are intentionally killing the audit daemon and attempting to ssh into the system. The proper fix(es) here would be to 1) set reasonable audit rules and/or 2) use an init system that monitors and restarts auditd when it fails (systemd has this capability, I believe some others do as well).

Both are not working. The auditd is not dead, it is in stop status(kill -s 19). So systemd/init will not restart it. Even if with little audit rules, after multiple accesses, the backlog will full due to no receiver whether, I think, the original behavior maybe better commit ac4cec443a80bfde829516e7a7db10f7325aa528 Author: David Woodhouse <dwmw2(a)shinybook.infradead.org> Date: Sat Jul 2 14:08:48 2005 +0100 AUDIT: Stop waiting for backlog after audit_panic() happens We force a rate-limit on auditable events by making them wait for space on the backlog queue. However, if auditd really is AWOL then this could potentially bring the entire system to a halt, depending on the audit rules in effect. Firstly, make sure the wait time is honoured correctly -- it's the maximum time the process should wait, rather than the time to wait _each_ time round the loop. We were getting re-woken _each_ time a packet was dequeued, and the timeout was being restarted each time. Secondly, reset the wait time after audit_panic() is called. In general this will be reset to zero, to allow progress to be made. If the system is configured to _actually_ panic on audit_panic() then that will already have happened; otherwise we know that audit records are being lost anyway. These two tunables can't be exposed via AUDIT_GET and AUDIT_SET because those aren't particularly well-designed. It probably should have been done by sysctls or sysfs anyway -- one for a later patch. Thanks -RongQing

...

-- paul moore www.paul-moore.com

Paul Moore

Wednesday, 18 September Wed, 18 Sep

7:23 a.m.

On Tue, Sep 17, 2019 at 9:07 PM Li,Rongqing <lirongqing(a)baidu.com> wrote:

...

> -----邮件原件----- > 发件人: Paul Moore [mailto:paul@paul-moore.com] > 发送时间: 2019年9月18日 3:17 > 收件人: Li,Rongqing <lirongqing(a)baidu.com> > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed > > On Mon, Sep 16, 2019 at 9:08 PM Li,Rongqing <lirongqing(a)baidu.com> wrote: > > > -----邮件原件----- > > > 发件人: Paul Moore [mailto:paul@paul-moore.com] > > > 发送时间: 2019年9月17日 6:52 > > > 收件人: Li,Rongqing <lirongqing(a)baidu.com> > > > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com > > > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed

...

> > I just want to it as before 3197542482df ("audit: rework > > audit_log_start()"), wait 60 seconds once if > > auditd/readaheaad-collector have some problem to drain the audit backlog. > > The patch you mention fixed what was deemed to be buggy behavior; as > mentioned previously in this thread I see no good reason to go back to the old > behavior. > > > > If you are not using audit, you can always disable it via the kernel > > > command line, or at runtime (look at what Fedora does). > > > > > > > > You might also want to investigate what is generating some many > > > > > audit records prior to starting the audit daemon. > > > > > > > > It is /sbin/readahead-collector, in fact, we stop the auditd; We > > > > are doing a > > > reboot test, which rebooting machine continue to test hardware/software. > > > > > > > > it is same as below: > > > > auditctl -a always,exit -S all -F pid='xxx' > > > > kill -s 19 `pidof auditd` > > > > > > > > then the audited task will be hung > > > > > > So you are seeing this problem only when you run a test, or did you > > > provide this as a reproducer? > > > > auditctl -a always,exit -S all -F ppid=`pidof sshd` kill -s 19 `pidof > > auditd` ssh root(a)127.0.0.1 > > > > then ssh will be hung forever > > That is expected behavior. You are putting a massive audit load on the system > by telling the kernel to audit every syscall that sshd makes, then you are > intentionally killing the audit daemon and attempting to ssh into the system. > The proper fix(es) here would be to 1) set reasonable audit rules and/or 2) use > an init system that monitors and restarts auditd when it fails (systemd has this > capability, I believe some others do as well). Both are not working. The auditd is not dead, it is in stop status(kill -s 19). So systemd/init will not restart it. Even if with little audit rules, after multiple accesses, the backlog will full due to no receiver

Fair point, however I still stand by my previous comments that there are runtime configuration knobs which can mitigate this problem if it is something you are concerned about. Depending on the situation, you can either increase the backlog to deal with transient problems, or decrease the backlog wait time (possibly to zero) to prevent blocking entirely. -- paul moore www.paul-moore.com

Li,Rongqing

8:50 p.m.

New subject: 答复: [PATCH][RFC] audit: set wait time to zero when audit failed

...

-----邮件原件----- 发件人: Paul Moore [mailto:paul@paul-moore.com] 发送时间: 2019年9月18日 20:23 收件人: Li,Rongqing <lirongqing(a)baidu.com> 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed On Tue, Sep 17, 2019 at 9:07 PM Li,Rongqing <lirongqing(a)baidu.com> wrote: > > -----邮件原件----- > > 发件人: Paul Moore [mailto:paul@paul-moore.com] > > 发送时间: 2019年9月18日 3:17 > > 收件人: Li,Rongqing <lirongqing(a)baidu.com> > > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com > > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed > > > > On Mon, Sep 16, 2019 at 9:08 PM Li,Rongqing <lirongqing(a)baidu.com> wrote: > > > > -----邮件原件----- > > > > 发件人: Paul Moore [mailto:paul@paul-moore.com] > > > > 发送时间: 2019年9月17日 6:52 > > > > 收件人: Li,Rongqing <lirongqing(a)baidu.com> > > > > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com > > > > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit > > > > failed ... > > > I just want to it as before 3197542482df ("audit: rework > > > audit_log_start()"), wait 60 seconds once if > > > auditd/readaheaad-collector have some problem to drain the audit backlog. > > > > The patch you mention fixed what was deemed to be buggy behavior; as > > mentioned previously in this thread I see no good reason to go back > > to the old behavior. > > > > > > If you are not using audit, you can always disable it via the > > > > kernel command line, or at runtime (look at what Fedora does). > > > > > > > > > > You might also want to investigate what is generating some > > > > > > many audit records prior to starting the audit daemon. > > > > > > > > > > It is /sbin/readahead-collector, in fact, we stop the auditd; > > > > > We are doing a > > > > reboot test, which rebooting machine continue to test hardware/software. > > > > > > > > > > it is same as below: > > > > > auditctl -a always,exit -S all -F pid='xxx' > > > > > kill -s 19 `pidof auditd` > > > > > > > > > > then the audited task will be hung > > > > > > > > So you are seeing this problem only when you run a test, or did > > > > you provide this as a reproducer? > > > > > > auditctl -a always,exit -S all -F ppid=`pidof sshd` kill -s 19 > > > `pidof auditd` ssh root(a)127.0.0.1 > > > > > > then ssh will be hung forever > > > > That is expected behavior. You are putting a massive audit load on > > the system by telling the kernel to audit every syscall that sshd > > makes, then you are intentionally killing the audit daemon and attempting to ssh into the system. > > The proper fix(es) here would be to 1) set reasonable audit rules > > and/or 2) use an init system that monitors and restarts auditd when > > it fails (systemd has this capability, I believe some others do as well). > > Both are not working. > The auditd is not dead, it is in stop status(kill -s 19). So systemd/init will not restart it. > Even if with little audit rules, after multiple accesses, the backlog > will full due to no receiver Fair point, however I still stand by my previous comments that there are runtime configuration knobs which can mitigate this problem if it is something you are concerned about. Depending on the situation, you can either increase the backlog to deal with transient problems, or decrease the backlog wait time (possibly to zero) to prevent blocking entirely.

No need knobs, auditctl can change the backlog length and wait time. And it is helpless to change the backlog length if auditd is hung forever, as a task can be hung forever due to disk/filesystem's abnormal, etc I am saying the audit default behaviors which is changed, I truly meet the issue as description of the below commit, if we can make change, other can avoid this issue. commit ac4cec443a80bfde829516e7a7db10f7325aa528 Author: David Woodhouse <dwmw2(a)shinybook.infradead.org> Date: Sat Jul 2 14:08:48 2005 +0100 AUDIT: Stop waiting for backlog after audit_panic() happens We force a rate-limit on auditable events by making them wait for space on the backlog queue. However, if auditd really is AWOL then this could potentially bring the entire system to a halt, depending on the audit rules in effect. Other method to avoid this issue to make audit_backlog_wait_time as 0 by default diff --git a/kernel/audit.c b/kernel/audit.c index da8dc0db5bd3..0a7f7c290644 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -119,7 +119,7 @@ static u32 audit_rate_limit; * When set to zero, this means unlimited. */ static u32 audit_backlog_limit = 64; #define AUDIT_BACKLOG_WAIT_TIME (60 * HZ) -static u32 audit_backlog_wait_time = AUDIT_BACKLOG_WAIT_TIME; +static u32 audit_backlog_wait_time = 0; /* The identity of the user shutting down the audit system. */ kuid_t audit_sig_uid = INVALID_UID; -RongQing

...

-- paul moore www.paul-moore.com

Paul Moore

9:30 p.m.

On Wed, Sep 18, 2019 at 9:50 PM Li,Rongqing <lirongqing(a)baidu.com> wrote:

...

> -----邮件原件----- > 发件人: Paul Moore [mailto:paul@paul-moore.com] > 发送时间: 2019年9月18日 20:23 > 收件人: Li,Rongqing <lirongqing(a)baidu.com> > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed > > On Tue, Sep 17, 2019 at 9:07 PM Li,Rongqing <lirongqing(a)baidu.com> wrote: > > > -----邮件原件----- > > > 发件人: Paul Moore [mailto:paul@paul-moore.com] > > > 发送时间: 2019年9月18日 3:17 > > > 收件人: Li,Rongqing <lirongqing(a)baidu.com> > > > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com > > > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed > > > > > > On Mon, Sep 16, 2019 at 9:08 PM Li,Rongqing <lirongqing(a)baidu.com> > wrote: > > > > > -----邮件原件----- > > > > > 发件人: Paul Moore [mailto:paul@paul-moore.com] > > > > > 发送时间: 2019年9月17日 6:52 > > > > > 收件人: Li,Rongqing <lirongqing(a)baidu.com> > > > > > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com > > > > > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit > > > > > failed > > ... > > > > > I just want to it as before 3197542482df ("audit: rework > > > > audit_log_start()"), wait 60 seconds once if > > > > auditd/readaheaad-collector have some problem to drain the audit > backlog. > > > > > > The patch you mention fixed what was deemed to be buggy behavior; as > > > mentioned previously in this thread I see no good reason to go back > > > to the old behavior. > > > > > > > > If you are not using audit, you can always disable it via the > > > > > kernel command line, or at runtime (look at what Fedora does). > > > > > > > > > > > > You might also want to investigate what is generating some > > > > > > > many audit records prior to starting the audit daemon. > > > > > > > > > > > > It is /sbin/readahead-collector, in fact, we stop the auditd; > > > > > > We are doing a > > > > > reboot test, which rebooting machine continue to test > hardware/software. > > > > > > > > > > > > it is same as below: > > > > > > auditctl -a always,exit -S all -F pid='xxx' > > > > > > kill -s 19 `pidof auditd` > > > > > > > > > > > > then the audited task will be hung > > > > > > > > > > So you are seeing this problem only when you run a test, or did > > > > > you provide this as a reproducer? > > > > > > > > auditctl -a always,exit -S all -F ppid=`pidof sshd` kill -s 19 > > > > `pidof auditd` ssh root(a)127.0.0.1 > > > > > > > > then ssh will be hung forever > > > > > > That is expected behavior. You are putting a massive audit load on > > > the system by telling the kernel to audit every syscall that sshd > > > makes, then you are intentionally killing the audit daemon and attempting > to ssh into the system. > > > The proper fix(es) here would be to 1) set reasonable audit rules > > > and/or 2) use an init system that monitors and restarts auditd when > > > it fails (systemd has this capability, I believe some others do as well). > > > > Both are not working. > > The auditd is not dead, it is in stop status(kill -s 19). So systemd/init will not > restart it. > > Even if with little audit rules, after multiple accesses, the backlog > > will full due to no receiver > > Fair point, however I still stand by my previous comments that there are > runtime configuration knobs which can mitigate this problem if it is something > you are concerned about. Depending on the situation, you can either increase > the backlog to deal with transient problems, or decrease the backlog wait time > (possibly to zero) to prevent blocking entirely. > No need knobs, auditctl can change the backlog length and wait time.

That is what I meant by "knobs". The term "knobs" is commonly used to reference some method of changing the configuration.

...

And it is helpless to change the backlog length if auditd is hung forever, as a task can be hung forever due to disk/filesystem's abnormal, etc

In this case changing the wait time would work (as previously mentioned). It is worth noting that the current code does not suffer from a "hung forever" problem if the audit queue is blocked, it may slow down quite a bit (dependent on the audit_backlog_wait_time variable), but it should still make forward progress.

...

I am saying the audit default behaviors which is changed, I truly meet the issue as description of the below commit, if we can make change, other can avoid this issue.

If we were hearing more reports of problems with the current defaults I would be inclined to change them, but to the best of my knowledge you are the only one who has run into this problem, so I would rather you simply update your audit configuration. -- paul moore www.paul-moore.com

Steve Grubb

9:33 p.m.

On Thu, 19 Sep 2019 01:50:05 +0000 "Li,Rongqing" <lirongqing(a)baidu.com> wrote:

...

I'd like to offer an opinion because this a long term issue that we have faced and what exists is the result of having to meet certain requirements. If the machine boots with audit=0, which I think is default, then the end user has no expectation of audit ever being in use. Audit events may be discarded if the backlog fills up. If however the machine boots with audit=1, then the user is expecting that there will eventually be an audit daemon and they want all events. All of them without fail. So, we have to take all measures to deliver those events because this is required by common criteria as well as other security standards such as PCI-DSS. So, there are 2 paths. One which does not care about audit and one that does. The original behavior did not meet requirements. If there is any patch that fixes this, it would be to not have an audit backlog wait time if audit has never been enabled. We have to be careful to consider audit never enabled, audit disabled but previously enabled, and audit enabled. HTH... -Steve

Li,Rongqing

Thursday, 19 September Thu, 19 Sep

2:12 a.m.

New subject: 答复: [PATCH][RFC] audit: set wait time to zero when audit failed

...

-----邮件原件----- 发件人: Steve Grubb [mailto:sgrubb@redhat.com] 发送时间: 2019年9月19日 10:34 收件人: Li,Rongqing <lirongqing(a)baidu.com> 抄送: Paul Moore <paul(a)paul-moore.com>; linux-audit(a)redhat.com 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed On Thu, 19 Sep 2019 01:50:05 +0000 "Li,Rongqing" <lirongqing(a)baidu.com> wrote: > No need knobs, auditctl can change the backlog length and wait time. > And it is helpless to change the backlog length if auditd is hung > forever, as a task can be hung forever due to disk/filesystem's > abnormal, etc > > I am saying the audit default behaviors which is changed, I truly meet > the issue as description of the below commit, if we can make change, > other can avoid this issue. I'd like to offer an opinion because this a long term issue that we have faced and what exists is the result of having to meet certain requirements. If the machine boots with audit=0, which I think is default, then the end user has no expectation of audit ever being in use. Audit events may be discarded if the backlog fills up. If however the machine boots with audit=1, then the user is expecting that there will eventually be an audit daemon and they want all events. All of them without fail. So, we have to take all measures to deliver those events because this is required by common criteria as well as other security standards such as PCI-DSS.

Ok, I see Thanks -RongQing

...

So, there are 2 paths. One which does not care about audit and one that does. The original behavior did not meet requirements. If there is any patch that fixes this, it would be to not have an audit backlog wait time if audit has never been enabled. We have to be careful to consider audit never enabled, audit disabled but previously enabled, and audit enabled. HTH... -Steve

Richard Guy Briggs

9:04 a.m.

On 2019-09-18 22:33, Steve Grubb wrote:

...

On Thu, 19 Sep 2019 01:50:05 +0000 "Li,Rongqing" <lirongqing(a)baidu.com> wrote: > No need knobs, auditctl can change the backlog length and wait time. > And it is helpless to change the backlog length if auditd is hung > forever, as a task can be hung forever due to disk/filesystem's > abnormal, etc > > I am saying the audit default behaviors which is changed, I truly > meet the issue as description of the below commit, if we can make > change, other can avoid this issue. I'd like to offer an opinion because this a long term issue that we have faced and what exists is the result of having to meet certain requirements. If the machine boots with audit=0, which I think is default, then the end user has no expectation of audit ever being in use. Audit events may be discarded if the backlog fills up.

In fact, the default is neither explicit audit=0 nor audit=1. The case above is the default where the audit subsystem is inactive until an audit daemon registers with the kernel. In the case of an explicit kernel command line of audit=0, audit is disabled until reboot and a daemon cannot register.

...

If however the machine boots with audit=1, then the user is expecting that there will eventually be an audit daemon and they want all events. All of them without fail. So, we have to take all measures to deliver those events because this is required by common criteria as well as other security standards such as PCI-DSS. So, there are 2 paths. One which does not care about audit and one that does. The original behavior did not meet requirements. If there is any patch that fixes this, it would be to not have an audit backlog wait time if audit has never been enabled. We have to be careful to consider audit never enabled, audit disabled but previously enabled, and audit enabled. HTH... -Steve

- RGB -- Richard Guy Briggs <rgb(a)redhat.com> Sr. S/W Engineer, Kernel Security, Base Operating Systems Remote, Ottawa, Red Hat Canada IRC: rgb, SunRaycer Voice: +1.647.777.2635, Internal: (81) 32635

2376

days inactive

2383

days old

linux-audit@lists.linux-audit.osci.io

Manage subscription

12 comments

5 participants

tags (0)

participants (5)

Li RongQing
Li,Rongqing
Paul Moore
Richard Guy Briggs
Steve Grubb

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[PATCH][RFC] audit: set wait time to zero when audit failed