Moving this discussion to kernel mailing lists.
Problem description:
Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
bpf_probe_read_compat call within a sched_switch tracepoint. The
problem is reproducible with the reg_alloc3 testcase from SystemTap's
BPF backend testsuite on x86_64 as well as the runqlat,runqslower
tools from bcc on ppc64le. Example stack trace from [1]:
[ 730.868702] stack backtrace:
[ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted
5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
[ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.13.0-2.fc32 04/01/2014
[ 730.873278] Call Trace:
[ 730.873770] dump_stack+0x7f/0xa1
[ 730.874433] check_noncircular+0xdf/0x100
[ 730.875232] __lock_acquire+0x1202/0x1e10
[ 730.876031] ? __lock_acquire+0xfc0/0x1e10
[ 730.876844] lock_acquire+0xc2/0x3a0
[ 730.877551] ? __wake_up_common_lock+0x52/0x90
[ 730.878434] ? lock_acquire+0xc2/0x3a0
[ 730.879186] ? lock_is_held_type+0xa7/0x120
[ 730.880044] ? skb_queue_tail+0x1b/0x50
[ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90
[ 730.881656] ? __wake_up_common_lock+0x52/0x90
[ 730.882532] __wake_up_common_lock+0x52/0x90
[ 730.883375] audit_log_end+0x5b/0x100
[ 730.884104] slow_avc_audit+0x69/0x90
[ 730.884836] avc_has_perm+0x8b/0xb0
[ 730.885532] selinux_lockdown+0xa5/0xd0
[ 730.886297] security_locked_down+0x20/0x40
[ 730.887133] bpf_probe_read_compat+0x66/0xd0
[ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820
[ 730.888917] trace_call_bpf+0xe9/0x240
[ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0
[ 730.890579] perf_trace_sched_switch+0x142/0x180
[ 730.891485] ? __schedule+0x6d8/0xb20
[ 730.892209] __schedule+0x6d8/0xb20
[ 730.892899] schedule+0x5b/0xc0
[ 730.893522] exit_to_user_mode_prepare+0x11d/0x240
[ 730.894457] syscall_exit_to_user_mode+0x27/0x70
[ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae
Jiri Olsa also reports seeing a similar deadlock at v5.10. I'm in the
middle of double-checking my bisection which ended up at a
seemingly-unrelated commit [2]
[1]
https://bugzilla.redhat.com/show_bug.cgi?id=1938312
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit...
Reasonable amount of context below:
On Thu, Mar 18, 2021 at 7:48 AM Jiri Olsa <jolsa(a)redhat.com> wrote:
> In that case the issue is in the selinux / audit department, not
on bpf side.
>
> To be honest, I'm actually puzzled that from bpf_probe_read_*() we end up
sending
> audit messages, this seems highly questionable given those BPF helpers are used in
> performance critical code, and they can be called from any contexts. So going and
> allocating an skb for audit is just completely wrong. It should probably be at min
> avc_has_perm_noaudit() if anything ...
I just noticed this discussion is not on the list ;-)
let's bring it there and include some audit folks
Yes, my apologies. This started as a quick note from me to Daniel to
glance at the RHBZ and the cc:s gradually snowballed from there.
- Serhei
jirka
>
> > ----
> > [ 56.866377] =============================
> > [ 56.866397] [ BUG: Invalid wait context ]
> > [ 56.866407] 5.11.0 #4 Tainted: G E
> > [ 56.866420] -----------------------------
> > [ 56.866438] swapper/69/0 is trying to lock:
> > [ 56.866458] c000000002120038 (notif_lock){....}-{3:3}, at:
avc_compute_av.isra.0+0x14c/0x430
> > [ 56.866508] other info that might help us debug this:
> > [ 56.866528] context-{2:2}
> > [ 56.866545] 3 locks held by swapper/69/0:
> > [ 56.866566] #0: c000001fff1f7a98 (&rq->lock){-.-.}-{2:2}, at:
sched_ttwu_pending+0x5c/0x1e0
> > [ 56.866613] #1: c00000000208b9d8 (rcu_read_lock){....}-{1:3}, at:
bpf_trace_run1+0x8/0x240
> > [ 56.866659] #2: c00000000208b9d8 (rcu_read_lock){....}-{1:3}, at:
avc_compute_av.isra.0+0x7c/0x430
> > [ 56.866704] stack backtrace:
> > [ 56.866724] CPU: 69 PID: 0 Comm: swapper/69 Tainted: G E
5.11.0 #4
> > [ 56.866761] Call Trace:
> > [ 56.866778] [c0000000109fb310] [c000000000a42784] dump_stack+0xe8/0x144
(unreliable)
> > [ 56.866817] [c0000000109fb360] [c0000000001f02a0]
__lock_acquire+0xaa0/0x2800
> > [ 56.866857] [c0000000109fb490] [c0000000001f2b40]
lock_acquire.part.0+0xc0/0x390
> > [ 56.866885] [c0000000109fb570] [c00000000118af0c]
_raw_spin_lock_irqsave+0x6c/0xc0
> > [ 56.866923] [c0000000109fb5b0] [c00000000089cc4c]
avc_compute_av.isra.0+0x14c/0x430
> > [ 56.866961] [c0000000109fb670] [c00000000089e0a0] avc_has_perm+0x2c0/0x300
> > [ 56.866997] [c0000000109fb780] [c0000000008a7d34]
selinux_lockdown+0xd4/0x100
> > [ 56.867034] [c0000000109fb810] [c000000000891140]
security_locked_down+0x50/0xb0
> > [ 56.867086] [c0000000109fb840] [c000000000346b7c]
bpf_probe_read_compat+0xbc/0x130
> > [ 56.867125] [c0000000109fb880] [c00800000e63bd38]
bpf_prog_3de2db9929262fab_raw_tracepoint__sched_wakeup+0x5c/0x4324
> > [ 56.867167] [c0000000109fb8f0] [c000000000349784] bpf_trace_run1+0xe4/0x240
> > [ 56.867204] [c0000000109fb940] [c00000000018f238]
__bpf_trace_sched_wakeup_template+0x18/0x30
> > [ 56.867243] [c0000000109fb960] [c000000000190834]
trace_sched_wakeup+0xe4/0x200
> > [ 56.867281] [c0000000109fb9a0] [c0000000001983bc] ttwu_do_wakeup+0x4c/0x1f0
> > [ 56.867317] [c0000000109fba20] [c00000000019c190]
sched_ttwu_pending+0x120/0x1e0
> > [ 56.867355] [c0000000109fbac0] [c00000000026cd6c]
flush_smp_call_function_queue+0x1bc/0x3c0
> > [ 56.867397] [c0000000109fbb50] [c000000000059fd4]
smp_ipi_demux_relaxed+0xf4/0x100
> > [ 56.867436] [c0000000109fbb90] [c0000000000537fc]
doorbell_exception+0xbc/0x370
> > [ 56.867474] [c0000000109fbbd0] [c0000000000168d4]
replay_soft_interrupts+0x1f4/0x2d0
> > [ 56.867512] [c0000000109fbdb0] [c000000000016a20]
arch_local_irq_restore+0x70/0xe0
> > [ 56.867550] [c0000000109fbde0] [c000000000df9d34]
cpuidle_enter_state+0x124/0x500
> > [ 56.867587] [c0000000109fbe40] [c000000000dfa1ac] cpuidle_enter+0x4c/0x70
> > [ 56.867613] [c0000000109fbe80] [c0000000001a5dc8] do_idle+0x338/0x450
> > [ 56.867649] [c0000000109fbf10] [c0000000001a62bc]
cpu_startup_entry+0x3c/0x40
> > [ 56.867686] [c0000000109fbf40] [c00000000005ac34]
start_secondary+0x2a4/0x2b0
> > [ 56.867727] [c0000000109fbf90] [c00000000000c054]
start_secondary_prolog+0x10/0x14
> >
>