Steve Grubb <sgrubb(a)redhat.com> writes:
Hello,
Thanks for the detailed notes on this investigation. It really is a lot of
good information backing this up. However, there will come a day when someone
sees this "major = ctx->major" and they will send a patch to "fix"
this
unnecessary assignment. If you are sending a V2 of this set, I would suggest
adding some comment in the code that this is for a performance improvement
and to see the commit message for additional info.
Thanks for the comment. Just sent out v2 of the last patch which
addresses this tangentially -- it adds a separate function parameter
for ctx->major/uring_op, so this shouldn't be an issue anymore.
Thanks
Ankur
Thanks,
-Steve
On Tuesday, September 27, 2022 6:59:42 PM EDT Ankur Arora wrote:
> ctx->major contains the current syscall number. This is, of course, a
> constant for the duration of the syscall. Unfortunately, GCC's alias
> analysis cannot prove that it is not modified via a pointer in the
> audit_filter_syscall() loop, and so always loads it from memory.
>
> In and of itself the load isn't very expensive (ops dependent on the
> ctx->major load are only used to determine the direction of control flow
> and have short dependence chains and, in any case the related branches
> get predicted perfectly in the fastpath) but still cache ctx->major
> in a local for two reasons:
>
> * ctx->major is in the first cacheline of struct audit_context and has
> similar alignment as audit_entry::list audit_entry. For cases
> with a lot of audit rules, doing this reduces one source of contention
> from a potentially busy cache-set.
>
> * audit_in_mask() (called in the hot loop in audit_filter_syscall())
> does cast manipulation and error checking on ctx->major:
>
> audit_in_mask(const struct audit_krule *rule, unsigned long val):
> if (val > 0xffffffff)
> return false;
>
> word = AUDIT_WORD(val);
> if (word >= AUDIT_BITMASK_SIZE)
> return false;
>
> bit = AUDIT_BIT(val);
>
> return rule->mask[word] & bit;
>
> The clauses related to the rule need to be evaluated in the loop, but
> the rest is unnecessarily re-evaluated for every loop iteration.
> (Note, however, that most of these are cheap ALU ops and the branches
> are perfectly predicted. However, see discussion on cycles
> improvement below for more on why it is still worth hoisting.)
>
> On a Skylakex system change in getpid() latency (aggregated over
> 12 boot cycles):
>
> Min Mean Median Max pstdev
> (ns) (ns) (ns) (ns)
>
> - 201.30 216.14 216.22 228.46 (+- 1.45%)
> + 196.63 207.86 206.60 230.98 (+- 3.92%)
>
> Performance counter stats for 'bin/getpid' (3 runs) go from:
> cycles 836.89 ( +- .80% )
> instructions 2000.19 ( +- .03% )
> IPC 2.39 ( +- .83% )
> branches 430.14 ( +- .03% )
> branch-misses 1.48 ( +- 3.37% )
> L1-dcache-loads 471.11 ( +- .05% )
> L1-dcache-load-misses 7.62 ( +- 46.98% )
>
> to:
> cycles 805.58 ( +- 4.11% )
> instructions 1654.11 ( +- .05% )
> IPC 2.06 ( +- 3.39% )
> branches 430.02 ( +- .05% )
> branch-misses 1.55 ( +- 7.09% )
> L1-dcache-loads 440.01 ( +- .09% )
> L1-dcache-load-misses 9.05 ( +- 74.03% )
>
> (Both aggregated over 12 boot cycles.)
>
> instructions: we reduce around 8 instructions/iteration because some of
> the computation is now hoisted out of the loop (branch count does not
> change because GCC, for reasons unclear, only hoists the computations
> while keeping the basic-blocks.)
>
> cycles: improve by about 5% (in aggregate and looking at individual run
> numbers.) This is likely because we now waste fewer pipeline resources
> on unnecessary instructions which allows the control flow to
> speculatively execute further ahead shortening the execution of the loop
> a little. The final gating factor on the performance of this loop
> remains the long dependence chain due to the linked-list load.
>
> Signed-off-by: Ankur Arora <ankur.a.arora(a)oracle.com>
> ---
> kernel/auditsc.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index 79a5da1bc5bb..533b087c3c02 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -843,13 +843,14 @@ static void audit_filter_syscall(struct task_struct
> *tsk, {
> struct audit_entry *e;
> enum audit_state state;
> + unsigned long major = ctx->major;
>
> if (auditd_test_task(tsk))
> return;
>
> rcu_read_lock();
> list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_EXIT],
list) {
> - if (audit_in_mask(&e->rule, ctx->major) &&
> + if (audit_in_mask(&e->rule, major) &&
> audit_filter_rules(tsk, &e->rule, ctx, NULL,
> &state, false)) {
> rcu_read_unlock();
--
ankur