[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shared pagetable benchmarking



Linus Torvalds wrote:
> 
> > ...
> The mmap() case should
> _not_ use that system call path at all, but should instead just call the
> populate function directly. Something like the appended patch.

Seems to do the right thing, but alas, it's slower:

without:
pushpatch 99  8.20s user 10.00s system 99% cpu 18.341 total
poppatch 99  5.76s user 6.65s system 99% cpu 12.521 total
c0114c64 kmap_atomic_to_page                          84   0.9438
c01308ec handle_mm_fault                              92   0.4340
c01c4b58 __copy_from_user                             94   0.8393
c012f330 clear_page_tables                           113   0.5650
c01305b0 do_anonymous_page                           123   0.3844
c011a9c0 do_softirq                                  145   0.8239
c0113d9c pte_alloc_one                               146   1.1406
c012f534 copy_page_range                             174   0.3595
c01c4af0 __copy_to_user                              188   1.8077
c01306f0 do_no_page                                  241   0.4744
c012f718 zap_pte_range                               265   0.6370
c0113ec0 do_page_fault                               321   0.2956
c0133a8c page_add_rmap                               322   1.1838
c0114be4 kmap_atomic                                 326   3.0185
c0133b9c page_remove_rmap                            360   0.9574
c012ff54 do_wp_page                                 1245   1.9095
00000000 total                                      6812   0.0042

(374019 pagefaults)

with:
pushpatch 99  8.16s user 11.76s system 99% cpu 20.072 total
poppatch 99  5.68s user 7.93s system 99% cpu 13.656 total
c012f330 clear_page_tables                           111   0.5550
c0114c64 kmap_atomic_to_page                         121   1.3596
c0113d9c pte_alloc_one                               140   1.0938
c011a9c0 do_softirq                                  150   0.8523
c01305b0 do_anonymous_page                           157   0.4906
c01c4af0 __copy_to_user                              157   1.5096
c012e590 install_page                                202   0.6012
c0113ec0 do_page_fault                               209   0.1924
c012f534 copy_page_range                             215   0.4442
c01306f0 do_no_page                                  224   0.4409
c0114be4 kmap_atomic                                 392   3.6296
c012f718 zap_pte_range                               417   1.0024
c0133a8c page_add_rmap                               563   2.0699
c0133b9c page_remove_rmap                            653   1.7367
c012ff54 do_wp_page                                 1318   2.0215
00000000 total                                      8072   0.0050

(240622 pagefaults)

That's uniprocessor, highpte.  Presumably there are lots of cached
libc pages which these scripts don't actually need.

It needs more analysis/instrumentation/work, but it's not promising.

Cache misses against the pte_chains is what is hurting here. Something
which may help on P4 is to keep the pte_chains at 32 bytes, so that
virtually-adjacent pages' pte_chains will probably share cachelines.  I
have a pseudo-4way HT box sitting here awaiting commissioning...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/