Hung processes stuck in D state after starting a second audit consumer

Friday, 15 June 2018

Greetings audit mailing list,

I've got an AWS instance running an Amazon Linux kernel,
4.9.77-31.58.amzn1.x86_64
with the base OS being CentOS 6. The instance had one program with the
audit handle (Proprietary closed source software, Lacework agent). I
installed and started OSQuery which attempted to acquire the audit handle.
I'm unsure if osquery was successful or not, because I was unable to ssh in
to the server to investigate. I ended up having to restart the instance.
Almost immediately after starting osquery, sshd got stuck in D state.
syslog has a hung task warning and backtrace that provides some information:

Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.724101]
...
 INFO: task sshd:1840 blocked for more than 10 seconds. 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.727856]
...
  Tainted: G            E   4.9.77-31.58.amzn1.x86_64 #1 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.731856]
...
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message. 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736283]
...
 sshd            D    0  1840   1839 0x00000080 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736287]
...
 0000000000000000 ffff8802025c6540 ffff88003684d940 ffff880205f3bb80

Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736290]
...
 ffff8802072582c0 ffffc900049bfc60 ffffffff81556e62 0000000000000001

Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736292]
...
 004200ca00000001 ffff8802072582c0 0000000000000000 ffffffff81a65140

Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736295] Call
...
 Trace: 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736302]
...
 [<ffffffff81556e62>] ? __schedule+0x242/0x700 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736304]
...
 [<ffffffff8155734c>] schedule+0x2c/0x80 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736306]
...
 [<ffffffff815575ee>] schedule_preempt_disabled+0xe/0x10 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736308]
...
 [<ffffffff81558f05>] __mutex_lock_slowpath+0x95/0x110 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736311]
...
 [<ffffffff8147a6f8>] ? __alloc_skb+0x78/0x1e0 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736313]
...
 [<ffffffff81558f97>] mutex_lock+0x17/0x30 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736316]
...
 [<ffffffff811178bd>] audit_receive+0x1d/0x90 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736319]
...
 [<ffffffff814c4976>] netlink_unicast+0x176/0x220 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736320]
...
 [<ffffffff814c4cf6>] netlink_sendmsg+0x2d6/0x390 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736323]
...
 [<ffffffff814719fe>] sock_sendmsg+0x3e/0x50 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736324]
...
 [<ffffffff81471ead>] SYSC_sendto+0x11d/0x150 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736327]
...
 [<ffffffff8111c68b>] ? __audit_syscall_entry+0xbb/0x100 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736330]
...
 [<ffffffff81003478>] ? syscall_trace_enter+0x1c8/0x2c0 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736332]
...
 [<ffffffff814728ee>] SyS_sendto+0xe/0x10 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736333]
...
 [<ffffffff81003b09>] do_syscall_64+0x59/0xc0 
Jun 14 20:54:53 packer_default-10-180-21-138 kernel: [12191989.736335]
...
 [<ffffffff8155bd70>] entry_SYSCALL64_slow_path+0x25/0x25

I've been doing some reading (ex.
https://www.redhat.com/archives/linux-audit/2016-February/msg00025.html )
and my understanding is that osquery should have been able to acquire the
audit handle, trampling lacework's ("Last one wins"), but I don't have
access to the Lacework code to know how it might handle that situation (I'm
engaging their support separately). I also noticed the patch set for 4.17
seemed to include some changes around the code path in the backtrace. I'm
trying to understand this behavior and determine if it's a bug, if said bug
has already been fixed by a patch between 4.9.77 and 4.17, or that this
issue is a lack of my understanding of the behavior of the audit system. It
is surprising to me that an audit system related issue would result in sshd
getting stuck in D state. Several other processes on the system continued
running without incident. Processed launched out of cron also got stuck in
D state. I would be grateful for some expert insight. If this isn't a bug
and is a misunderstanding on my part, is there any way to configure the
audit system such that an issue won't result in processes getting stuck in
D state?

Thanks,
Preston Bennes

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004