-----邮件原件-----
发件人: Paul Moore [mailto:paul@paul-moore.com]
发送时间: 2019年9月18日 3:17
收件人: Li,Rongqing <lirongqing(a)baidu.com>
抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com
主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed
On Mon, Sep 16, 2019 at 9:08 PM Li,Rongqing <lirongqing(a)baidu.com> wrote:
> > -----邮件原件-----
> > 发件人: Paul Moore [mailto:paul@paul-moore.com]
> > 发送时间: 2019年9月17日 6:52
> > 收件人: Li,Rongqing <lirongqing(a)baidu.com>
> > 抄送: Eric Paris <eparis(a)redhat.com>; linux-audit(a)redhat.com
> > 主题: Re: [PATCH][RFC] audit: set wait time to zero when audit failed
> >
> > On Sun, Sep 15, 2019 at 10:55 PM Li,Rongqing <lirongqing(a)baidu.com>
wrote:
> > > > > if audit_log_start failed because queue is full, kauditd is
> > > > > waiting the receiving queue empty, but no receiver, a task
> > > > > will be forced to wait 60 seconds for each audited syscall,
> > > > > and it will be hang for a very long time
> > > > >
> > > > > so at this condition, set the wait time to zero to reduce
> > > > > wait, and restore wait time when audit works again
> > > > >
> > > > > it partially restore the commit 3197542482df ("audit:
rework
> > > > > audit_log_start()")
> > > > >
> > > > > Signed-off-by: Li RongQing <lirongqing(a)baidu.com>
> > > > > Signed-off-by: Liang ZhiCheng <liangzhicheng(a)baidu.com>
> > > > > ---
> > > > > reboot is taking a very long time on my machine(centos 6u4
> > > > > +kernel
> > > > > 5.3) since TIF_SYSCALL_AUDIT is set by default, and when
> > > > > reboot, userspace process which receiver audit message , will
> > > > > be killed, and lead to that no user drain the audit queue
> > > > >
> > > > > git bitsect show it is caused by 3197542482df ("audit:
rework
> > > > > audit_log_start()")
> > > > >
> > > > > kernel/audit.c | 9 +++++++--
> > > > > 1 file changed, 7 insertions(+), 2 deletions(-)
> > > >
> > > > This is typically solved by increasing the backlog using the
> > "audit_backlog_limit"
> > > > kernel parameter (link to the docs below).
> > >
> > > It should be able to avoid my issue, but the default behaviors
> > > does not
> > working for me; And not all have enough knowledge about audit, who
> > maybe spend lots of effort to find the root cause, and estimate how
> > large should be "audit_backlog_limit"
> >
> > The pause/sleep behavior is desired behavior and is intended to help
> > kauditd/auditd process the audit backlog on a busy system. If we
> > didn't sleep the current process and give kauditd/auditd a chance to
> > flush the backlog when it was full, a lot of bad things could happen
> > with respect to audit. We generally select the backlog limit so
> > that this is not a problem for most systems, although there will
> > always be edge cases where the default does not work well; it is impossible
to pick defaults that work well for every case.
> >
>
> I just want to it as before 3197542482df ("audit: rework
> audit_log_start()"), wait 60 seconds once if
> auditd/readaheaad-collector have some problem to drain the audit backlog.
The patch you mention fixed what was deemed to be buggy behavior; as
mentioned previously in this thread I see no good reason to go back to the old
behavior.
> > If you are not using audit, you can always disable it via the kernel
> > command line, or at runtime (look at what Fedora does).
> >
> > > > You might also want to investigate what is generating some many
> > > > audit records prior to starting the audit daemon.
> > >
> > > It is /sbin/readahead-collector, in fact, we stop the auditd; We
> > > are doing a
> > reboot test, which rebooting machine continue to test hardware/software.
> > >
> > > it is same as below:
> > > auditctl -a always,exit -S all -F pid='xxx'
> > > kill -s 19 `pidof auditd`
> > >
> > > then the audited task will be hung
> >
> > So you are seeing this problem only when you run a test, or did you
> > provide this as a reproducer?
>
> auditctl -a always,exit -S all -F ppid=`pidof sshd` kill -s 19 `pidof
> auditd` ssh root(a)127.0.0.1
>
> then ssh will be hung forever
That is expected behavior. You are putting a massive audit load on the system
by telling the kernel to audit every syscall that sshd makes, then you are
intentionally killing the audit daemon and attempting to ssh into the system.
The proper fix(es) here would be to 1) set reasonable audit rules and/or 2) use
an init system that monitors and restarts auditd when it fails (systemd has this
capability, I believe some others do as well).
Both are not working.
The auditd is not dead, it is in stop status(kill -s 19). So systemd/init will not restart
it.
Even if with little audit rules, after multiple accesses, the backlog will full due to no
receiver
whether, I think, the original behavior maybe better
commit ac4cec443a80bfde829516e7a7db10f7325aa528
Author: David Woodhouse <dwmw2(a)shinybook.infradead.org>
Date: Sat Jul 2 14:08:48 2005 +0100
AUDIT: Stop waiting for backlog after audit_panic() happens
We force a rate-limit on auditable events by making them wait for space
on the backlog queue. However, if auditd really is AWOL then this could
potentially bring the entire system to a halt, depending on the audit
rules in effect.
Firstly, make sure the wait time is honoured correctly -- it's the
maximum time the process should wait, rather than the time to wait
_each_ time round the loop. We were getting re-woken _each_ time a
packet was dequeued, and the timeout was being restarted each time.
Secondly, reset the wait time after audit_panic() is called. In general
this will be reset to zero, to allow progress to be made. If the system
is configured to _actually_ panic on audit_panic() then that will
already have happened; otherwise we know that audit records are being
lost anyway.
These two tunables can't be exposed via AUDIT_GET and AUDIT_SET because
those aren't particularly well-designed. It probably should have been
done by sysctls or sysfs anyway -- one for a later patch.
Thanks
-RongQing