On Mon, 2006-12-18 at 21:41 -0800, Paul Jackson wrote:
Matt wrote:
> - Task watchers can actually improve kernel performance slightly (up to
> 2% in extremely fork-heavy workloads for instance).
Nice.
Could you explain why?
After the last round of patches I set out to improve instruction and
data cache hits.
Previous iterations of task watchers would prevent the code in these
paths from being inlined. Furthermore, the code certainly wouldn't be
placed near the table of function pointers (which was in an entirely
different ELF section). By placing them adjacent to each other in the
same ELF section we can improve the likelihood of cache hits in
fork-heavy workloads (which were the ones that showed a performance
decrease in the previous iteration of these patches).
Suppose we have two functions to invoke during fork -- A and B. Here's
what the memory layout looked like in the previous iteration of task
watchers:
+--------------+<----+
| insns of B | |
. |
. |
. |
| | |
+--------------+ |
. |
. |
. |
+--------------+<-+ |
| insns of A | | |
. | |
. | |
. | |
| | | |
+--------------+ | |
. | |
. | |
. | | .text
==================|==|========= ELF Section Boundary
+--------------+ | | .task
| pointer to A----+ |
+--------------+ |
| pointer to B-------+
+--------------+
The notify_task_watchers() function would first load the pointer to A from the .task
section. Then it would immediately jump into the .text section and force the
instructions from A to be loaded. When A was finished, it would return to
notify_task_watchers() only to jump into B by the same steps.
As you can see things can be rather spread out. Unless the compiler inlined the
functions called from copy_process() things are very similar in a mainline
kernel -- copy_process() could be jumping to rather distant portions of the kernel
text and the pointer table would be rather distant from the instructions to be loaded.
Here's what the new patches look like:
===============================
+--------------+ .task
| pointer to A----+
+--------------+ |
| pointer to B-------+
+--------------+ | |
. | |
. | |
+--------------+<-+ |
| insns of A | |
. |
. |
. |
| | |
+--------------+<----+
| insns of B |
.
.
.
| |
+--------------+
===============================
Which is clearly more compact and also follows the order of calls (A
then B). The instructions are all in the same section. When A finishes
executing we soon jump into B which could be in the same instruction
cache line as the function we just left. Furthermore, since the sequence
always goes from A to B I expect some anticipatory loads could be done.
For fork-heavy workloads I'd expect this to explain the performance
difference. For workloads that aren't fork-heavy I suspect we're just as
likely to experience instruction cache misses -- whether the functions
are inlined, adjacent, or not -- since the fork happens relatively
infrequently and other instructions are likely to push fork-related
instructions out.
Cheers,
-Matt Helsley