On Wed, May 26, 2021 at 3:44 PM Jens Axboe <axboe(a)kernel.dk> wrote:
On 5/26/21 12:44 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 2:01 PM Jens Axboe <axboe(a)kernel.dk> wrote:
>> On 5/26/21 11:54 AM, Jens Axboe wrote:
>>> On 5/26/21 11:31 AM, Jens Axboe wrote:
>>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
>>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe
<axboe(a)kernel.dk> wrote:
>>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>>>>> That said, audit is not for everyone, and we have build
time and
>>>>>>>> runtime options to help make life easier. Beyond simply
disabling
>>>>>>>> audit at compile time a number of Linux distributions
effectively
>>>>>>>> shortcut audit at runtime by adding a "never"
rule to the audit
>>>>>>>> filter, for example:
>>>>>>>>
>>>>>>>> % auditctl -a task,never
>>>>>>>
>>>>>>> As has been brought up, the issue we're facing is that
distros have
>>>>>>> CONFIG_AUDIT=y and hence the above is the best real world
case outside
>>>>>>> of people doing custom kernels. My question would then be
how much
>>>>>>> overhead the above will add, considering it's an
entry/exit call per op.
>>>>>>> If auditctl is turned off, what is the expectation in turns
of overhead?
>>>>>>
>>>>>> I commented on that case in my last email to Pavel, but I'll
try to go
>>>>>> over it again in a little more detail.
>>>>>>
>>>>>> As we discussed earlier in this thread, we can skip the
req->opcode
>>>>>> check before both the _entry and _exit calls, so we are left
with just
>>>>>> the bare audit calls in the io_uring code. As the _entry and
_exit
>>>>>> functions are small, I've copied them and their supporting
functions
>>>>>> below and I'll try to explain what would happen in
CONFIG_AUDIT=y,
>>>>>> "task,never" case.
>>>>>>
>>>>>> + static inline struct audit_context *audit_context(void)
>>>>>> + {
>>>>>> + return current->audit_context;
>>>>>> + }
>>>>>>
>>>>>> + static inline bool audit_dummy_context(void)
>>>>>> + {
>>>>>> + void *p = audit_context();
>>>>>> + return !p || *(int *)p;
>>>>>> + }
>>>>>>
>>>>>> + static inline void audit_uring_entry(u8 op)
>>>>>> + {
>>>>>> + if (unlikely(audit_enabled && audit_context()))
>>>>>> + __audit_uring_entry(op);
>>>>>> + }
>>>>>>
>>>>>> We have one if statement where the conditional checks on two
>>>>>> individual conditions. The first (audit_enabled) is simply a
check to
>>>>>> see if anyone has "turned on" auditing at runtime;
historically this
>>>>>> worked rather well, and still does in a number of places, but
ever
>>>>>> since systemd has taken to forcing audit on regardless of the
admin's
>>>>>> audit configuration it is less useful. The second
(audit_context())
>>>>>> is a check to see if an audit_context has been allocated for
the
>>>>>> current task. In the case of "task,never"
current->audit_context will
>>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry()
slowpath
>>>>>> will never be called.
>>>>>>
>>>>>> Worst case here is checking the value of audit_enabled and
>>>>>> current->audit_context. Depending on which you think is more
likely
>>>>>> we can change the order of the check so that the
>>>>>> current->audit_context check is first if you feel that is
more likely
>>>>>> to be NULL than audit_enabled is to be false (it may be that way
now).
>>>>>>
>>>>>> + static inline void audit_uring_exit(int success, long code)
>>>>>> + {
>>>>>> + if (unlikely(!audit_dummy_context()))
>>>>>> + __audit_uring_exit(success, code);
>>>>>> + }
>>>>>>
>>>>>> The exit call is very similar to the entry call, but in the
>>>>>> "task,never" case it is very simple as the first check
to be performed
>>>>>> is the current->audit_context check which we know to be NULL.
The
>>>>>> __audit_uring_exit() slowpath will never be called.
>>>>>
>>>>> I actually ran some numbers this morning. The test base is 5.13+,
and
>>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the
baseline
>>>>> test and the test with this series applied. I used your git branch
as of
>>>>> this morning.
>>>>>
>>>>> The test case is my usual peak perf test, which is random reads at
>>>>> QD=128 and using polled IO. It's a single core test, not
threaded. I ran
>>>>> two different tests - one was having a thread just do the IO, the
other
>>>>> is using SQPOLL to do the IO for us. The device is capable than
more
>>>>> IOPS than a single core can deliver, so we're CPU limited in
this test.
>>>>> Hence it's a good test case as it does actual work, and shows
software
>>>>> overhead quite nicely. Runs are very stable (less than 0.5%
difference
>>>>> between runs on the same base), yet I did average 4 runs.
>>>>>
>>>>> Kernel SQPOLL IOPS Perf diff
>>>>> ---------------------------------------------------------
>>>>> 5.13 0 3029872 0.0%
>>>>> 5.13 1 3031056 0.0%
>>>>> 5.13 + audit 0 2894160 -4.5%
>>>>> 5.13 + audit 1 2886168 -4.8%
>>>>>
>>>>> That's an immediate drop in perf of almost 5%. Looking at a
quick
>>>>> profile of it (nothing fancy, just checking for 'audit' in
the profile)
>>>>> shows this:
>>>>>
>>>>> + 2.17% io_uring [kernel.vmlinux] [k] __audit_uring_entry
>>>>> + 0.71% io_uring [kernel.vmlinux] [k] __audit_uring_exit
>>>>> 0.07% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
>>>>> 0.02% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
>>>>>
>>>>> Note that this is with _no_ rules!
>>>>
>>>> io_uring also supports a NOP command, which basically just measures
>>>> reqs/sec through the interface. Ran that as well:
>>>>
>>>> Kernel SQPOLL IOPS Perf diff
>>>> ---------------------------------------------------------
>>>> 5.13 0 31.05M 0.0%
>>>> 5.13 + audit 0 25.31M -18.5%
>>>>
>>>> and profile for the latter includes:
>>>>
>>>> + 5.19% io_uring [kernel.vmlinux] [k] __audit_uring_entry
>>>> + 4.31% io_uring [kernel.vmlinux] [k] __audit_uring_exit
>>>> 0.26% io_uring [kernel.vmlinux] [k] __audit_syscall_entry
>>>> 0.08% io_uring [kernel.vmlinux] [k] __audit_syscall_exit
>>>
>>> As Pavel correctly pointed it, looks like auditing is enabled. And
>>> indeed it was! Hence the above numbers is without having turned off
>>> auditing. Running the NOPs after having turned off audit, we get 30.6M
>>> IOPS, which is down about 1.5% from the baseline. The results for the
>>> polled random read test above did _not_ change from this, they are still
>>> down the same amount.
>>>
>>> Note, and I should have included this in the first email, this is not
>>> any kind of argument for or against audit logging. It's purely meant to
>>> be a set of numbers that show how the current series impacts
>>> performance.
>>
>> And finally, just checking if we make it optional per opcode if we see
>> any real impact, and the answer is no. Using the below patch which
>> effectively bypasses audit calls unless the opcode has flagged the need
>> to do so, I cannot measure any difference in perf (as expected).
>>
>> To turn this into something useful, my suggestion as a viable path
>> forward would be:
>>
>> 1) Use something like the below patch and flag request types that we
>> want to do audit logging for.
>>
>> 2) As Pavel suggested, eliminate the need for having both and entry/exit
>> hook, turning it into just one. That effectively cuts the number of
>> checks and calls in half.
>
> I suspect the updated working-io_uring branch with HEAD at
> 1f25193a3f54 (updated a short time ago, see my last email in this
> thread) will improve performance. Also, as has been mention several
> times now, for audit to work we need both the _entry and _exit call.
Pulled in your new stuff, and ran a quick test again (same as before,
the rand reads). No change, we're still down ~5% with auditctl -a
task,never having been run. If I stub out your two audit entry/exit
calls by adding an if (0) before them, then perf is back to normal.
That's fun.
Considering that we are down to basically a single if-conditional in
the audit code and your other tests with another single conditional
yielded no significant change it looks like we really only have one
option left before we "agree to disagree" and have to mark io_uring
and audit as mutually exclusive in Kconfig. If we moved the _entry
and _exit calls into the individual operation case blocks (quick
openat example below) so that only certain operations were able to be
audited would that be acceptable assuming the high frequency ops were
untouched? My initial gut feeling was that this would involve >50% of
the ops, but Steve Grubb seems to think it would be less; it may be
time to look at that a bit more seriously, but if it gets a NACK
regardless it isn't worth the time - thoughts?
case IORING_OP_OPENAT:
audit_uring_entry(req->opcode);
ret = io_openat(req, issue_flags);
audit_uring_exit(!ret, ret);
break;
--
paul moore
www.paul-moore.com