This series attempts to reduce syscall-exit latency in the audit path,
especially for cases where there are a lot of audit exit rules.
From profiling, audit_filter_syscall() takes the largest chunk of
time,
specifically in this list-walk while processing the AUDIT_FILTER_EXIT
list:
list = &audit_filter_list[AUDIT_FILTER_EXIT];
list_for_each_entry_rcu(e, list, list) {
if (audit_in_mask(&e->rule, ctx->major) &&
audit_filter_rules(tsk, &e->rule, ctx, NULL,
&state, false, x)) {
rcu_read_unlock();
ctx->current_state = state;
return state;
}
}
(Note that almost identical logic exists in audit_filter_uring(),
audit_filter_inode_name().)
Comparing baseline performance with audit=0/audit=1 with a user-space
getpid() loop (executes 10^7 times.) For audit=1, this uses an
audit-rule set where the audit_filter_syscall() loop iterates over
42 AUDIT_FILTER_EXIT rules which, for getpid(), calls audit_filter_rules()
for 5 of them (we use this set of rules in production.)
Test system:
Server: ORACLE SERVER X8-2L
CPU: Intel Skylakex (6:85:6)
Microcode: 0x400320a
# v6.0.0-rc5.baseline, audit=0
Performance counter stats for 'bin/getpid' (3 runs):
734.10 msec task-clock # 0.999 CPUs utilized
( +- 0.03% )
1 context-switches # 1.361 /sec
( +- 66.67% )
0 cpu-migrations # 0.000 /sec
53 page-faults # 72.152 /sec
( +- 0.63% )
2,838,869,156 cycles # 3.865 GHz
( +- 0.01% ) (62.40%)
4,174,224,305 instructions # 1.47 insn per cycle
( +- 0.01% ) (74.93%)
890,798,133 branches # 1.213 G/sec
( +- 0.01% ) (74.93%)
5,015,118 branch-misses # 0.56% of all branches
( +- 0.05% ) (74.93%)
1,231,150,558 L1-dcache-loads # 1.676 G/sec
( +- 0.01% ) (74.94%)
418,297 L1-dcache-load-misses # 0.03% of all L1-dcache
accesses ( +- 0.68% ) (75.07%)
3,937 LLC-loads # 5.360 K/sec
( +- 3.76% ) (50.13%)
510 LLC-load-misses # 13.39% of all LL-cache
accesses ( +- 79.89% ) (50.00%)
0.735018 +- 0.000275 seconds time elapsed ( +- 0.04% )
# v6.0.0-rc5.baseline, audit=1
Performance counter stats for 'bin/getpid' (3 runs):
2,158.81 msec task-clock # 0.998 CPUs utilized
( +- 0.13% )
2 context-switches # 0.925 /sec
( +- 28.87% )
0 cpu-migrations # 0.000 /sec
52 page-faults # 24.056 /sec
( +- 0.64% )
8,364,119,898 cycles # 3.869 GHz
( +- 0.11% ) (62.48%)
19,996,521,678 instructions # 2.39 insn per cycle
( +- 0.01% ) (74.98%)
4,302,068,252 branches # 1.990 G/sec
( +- 0.00% ) (74.98%)
15,135,694 branch-misses # 0.35% of all branches
( +- 0.16% ) (74.99%)
4,714,694,841 L1-dcache-loads # 2.181 G/sec
( +- 0.01% ) (74.99%)
66,407,857 L1-dcache-load-misses # 1.41% of all L1-dcache
accesses ( +- 1.50% ) (75.01%)
6,785 LLC-loads # 3.139 K/sec
( +- 12.49% ) (50.03%)
3,235 LLC-load-misses # 41.29% of all LL-cache
accesses ( +- 6.08% ) (50.01%)
2.16213 +- 0.00288 seconds time elapsed ( +- 0.13% )
perf stat numbers for each getpid() iteration:
baseline baseline
audit=0 audit=1
cycles 283 836
instructions 417 1999
IPC 1.47 2.39
branches 89 430
branch-misses 0.50 1.51
L1-loads 123 471
L1-load-misses ~0 ~6.6*
* the L1-load-misses are largely stable for runs across a single
boot, but vary across boots due to vagaries of the SLAB allocator.
baseline audit=1 spends a significant amount of time executing in audit
code and incurs a new branch-miss and a few new L1-load-misses. Also
note that computed audit-only IPC is 2.86 so the baseline is not
ill-performant code.
Patches
==
Patch 1 "audit: cache ctx->major in audit_filter_syscall()", caches
ctx->major in a local variable. This gets rid of a persistent entry
from L1-dcache (audit_context::major) that had similar alignment
constraints as a potentially busy cache-set (audit_entry::list) and
allows some of the error checking in audit_in_mask() to be hoisted out
of the loop.
Patch 2: "audit: annotate branch direction for audit_in_mask()", so
the compiler can pessimize the call to audit_filter_rules().
Patch 3, "audit: unify audit_filter_{uring(),inode_name(),syscall()}"
centralizes this logic in a common function.
with these changes:
Performance counter stats for 'bin/getpid' (3 runs):
1,750.21 msec task-clock # 0.994 CPUs utilized
( +- 0.45% )
3 context-switches # 1.705 /sec
( +- 11.11% )
0 cpu-migrations # 0.000 /sec
52 page-faults # 29.548 /sec
( +- 0.64% )
6,770,976,590 cycles # 3.848 GHz
( +- 0.40% ) (27.74%)
16,588,372,024 instructions # 2.44 insn per cycle
( +- 0.03% ) (33.34%)
4,322,555,829 branches # 2.456 G/sec
( +- 0.02% ) (33.40%)
2,803,286 branch-misses # 0.06% of all branches
( +- 26.45% ) (33.46%)
4,449,426,336 L1-dcache-loads # 2.528 G/sec
( +- 0.01% ) (27.71%)
63,612,108 L1-dcache-load-misses # 1.43% of all L1-dcache
accesses ( +- 0.50% ) (27.71%)
6,123 LLC-loads # 3.479 K/sec
( +- 8.68% ) (27.71%)
1,890 LLC-load-misses # 26.69% of all LL-cache
accesses ( +- 46.99% ) (27.71%)
1.76033 +- 0.00791 seconds time elapsed ( +- 0.45% )
And, overall getpid() latency numbers (aggregated over 12 boots for each):
Min Mean Median Max pstdev
(ns) (ns) (ns) (ns)
baseline 201.30 216.14 216.22 228.46 (+- 1.45%)
patch1 196.63 207.86 206.60 230.98 (+- 3.92%)
patch1-2 173.11 182.51 179.65 202.09 (+- 4.34%)
patch1-3 162.11 175.26 173.71 190.56 (+- 4.33%)
This gives a reasonable speedup. My testing was on Intel Skylake, but I
suspect this should translate to other archs as well (especially on less
wide architectures.)
Please review.
Thanks
Ankur
Ankur Arora (3):
audit: cache ctx->major in audit_filter_syscall()
audit: annotate branch direction for audit_in_mask()
audit: unify audit_filter_{uring(),inode_name(),syscall()}
kernel/auditsc.c | 86 +++++++++++++++++++++++++-----------------------
1 file changed, 45 insertions(+), 41 deletions(-)
--
2.31.1