[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: vm lock contention reduction



Andrea Arcangeli wrote:
> 
> On Thu, Jul 04, 2002 at 11:33:45PM -0700, Andrew Morton wrote:
> > Well.  First locks first.  kmap_lock is a bad one on x86.
> 
> Actually I thought about kmap_lock and the per-process kmaps a bit more
> with Martin (cc'ed) during OLS and there is an easy process-scalable
> solution to drop:

Martin is being bitten by the global invalidate more than by the lock.
He increased the size of the kmap pool just to reduce the invalidate
frequency and saw 40% speedups of some stuff.

Those invalidates don't show up nicely on profiles.

>         the kmap_lock
>         in turn the global pool
>         in turn the global tlb flush
> 
> The only problem is that it's not anymore both atomic *and* persistent,
> it's only persistent. It's also atomic if the mm_count == 1, but the
> kernel cannot rely on it, it has to assume it's a blocking operation
> always (you find it out if it's blocking only at runtime).

I was discussing this with sct a few days back.  iiuc, the proposal
was to create a small per-cpu pool (say, 4-8 pages) which is a
"front-end" to regular old kmap().

Any time you have one of these pages in use, the process gets
pinned onto the current CPU. If we run out of per-cpu kmaps,
just fall back to traditional kmap().

It does mean that this variant of kmap() couldn't just return
a `struct page *' - it would have to return something richer
than that.

> In short the same design of the per-process kmaps will work just fine if
> we add a semaphore to the mm_struct. then before starting using the kmap
> entry we must acquire the semaphore. This way all the global locking and
> global tlb flush goes away completely for normal tasks, but still
> remains the contention of that per-mm semaphore with threads doing
> simutaneous pte manipulation or simultaneous pagecache I/O though.
> Furthmore this I/O will be serialized, threaded benchmark like dbench
> may perform poorly that way I suspect, or we should add a pool of
> userspace pages so more than 1 thread is allowed to go ahead, but still
> we may cacheline-bounce in the synchronization of the pool across
> threads (similar to what we do now in the global pool).
> 
> Then there's the problem the pagecache/FS API should be changed to pass
> the vaddr through the stack because page->virtual would go away, the
> virtual address would be per-process protected by the mm->kmap_sem so we
> couldn't store it in a global, all tasks can kmap the same page at the
> same time at virtual vaddr. This as well will break some common code.
> 
> Last but not the least, I hope in 2.6 production I won't be running
> benchmarks and profiling using a 32bit cpu anymore anyways.
> 
> So I'm not very motivated anymore in doing that change after the comment
> from Linus about the issue with threads.

I believe that IBM have 32gig, 8- or 16-CPU ia32 machines just
coming into production now.  Presumably, they're not the only
ones.  We're stuck with this mess for another few years.

-
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/