[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Question on swapping
Hello Raghav,
Thanks for your reply.
> But I have my own doubts on swapping which I would like to get
> cleared. I am unable to get the reason why at all a shared page
> gets transferred to the disk as long as it is in use.
>
> Why won't the following steps work:
> (i) In case the page is shared and the 1st time try_to_swap_out()
> is called : the page is transferred to swap cache and
> __free_page() is called the page count is not zero. Then do not
> transfer the page to disk.
> (ii) When the last process that shared the page calls
> try_to_swap_out: the pagecount hits 0.
It would only drop to 1, since the swap cache also has a reference on that page(?)
> Then transfer the page to
> disk.
> This way for shared pages only one disk transer(which is
> expensive) gets done for shared pages.
I was wondering about that as well. However, I could not find the problem in 2.4.2 (see below my commented try_to_swap_out() code from 2.4.2 and 2.2.18 respectively - I
marked my comments with ### M. Maletinsky).
See especially the comments at the end of each of the code excerpts.
with best regards
Martin Maletinsky
---------------------------------
2.2.18:
static int try_to_swap_out(struct task_struct * tsk, struct vm_area_struct* vma,
unsigned long address, pte_t * page_table, int gfp_mask)
{
pte_t pte;
unsigned long entry;
unsigned long page;
struct page * page_map;
pte = *page_table;
if (!pte_present(pte))
return 0;
page = pte_page(pte);
if (MAP_NR(page) >= max_mapnr)
return 0;
page_map = mem_map + MAP_NR(page);
if (pte_young(pte)) {
/*
* Transfer the "accessed" bit from the page
* tables to the global page map.
*/
set_pte(page_table, pte_mkold(pte));
flush_tlb_page(vma, address);
set_bit(PG_referenced, &page_map->flags);
return 0;
}
if (PageReserved(page_map)
|| PageLocked(page_map)
|| ((gfp_mask & __GFP_DMA) && !PageDMA(page_map)))
return 0;
/*
* Is the page already in the swap cache? If so, then
* we can just drop our reference to it without doing
* any IO - it's already up-to-date on disk.
### M. Maletinsky:
Why is that? the page may have become dirty by the process from which it is currently being unmapped (i.e. tsk). In this case the in-memory image differs from the on disk
image, while the
page descriptor does not have its PG_dirty bit set. Moreover *pte (which had it's dirty bit being set by the MMU, when the process did write into the page) is discarded by
the subsequent lines of code - with the result, that the information that the page was written to is lost.
*
* Return 0, as we didn't actually free any real
* memory, and we should just continue our scan.
*/
if (PageSwapCache(page_map)) {
entry = page_map->offset;
swap_duplicate(entry);
set_pte(page_table, __pte(entry));
drop_pte:
vma->vm_mm->rss--;
flush_tlb_page(vma, address);
### M. Maletinsky:
This is the latest point, where I would expect the page to become dirty (actually in the 2.4.2 code the page is made dirty at more or less this point - see below).
__free_page(page_map);
return 0;
}
/*
* Is it a clean page? Then it must be recoverable
* by just paging it in again, and we can just drop
* it..
*
* However, this won't actually free any real
* memory, as the page will just be in the page cache
* somewhere, and as such we should just continue
* our scan.
*
* Basically, this just makes it possible for us to do
* some real work in the future in "shrink_mmap()".
*/
if (!pte_dirty(pte)) {
flush_cache_page(vma, address);
pte_clear(page_table);
goto drop_pte;
}
/*
* Don't go down into the swap-out stuff if
* we cannot do I/O! Avoid recursing on FS
* locks etc.
*/
if (!(gfp_mask & __GFP_IO))
return 0;
/*
* Ok, it's really dirty. That means that
* we should either create a new swap cache
* entry for it, or we should write it back
* to its own backing store.
*
* Note that in neither case do we actually
* know that we make a page available, but
* as we potentially sleep we can no longer
* continue scanning, so we migth as well
* assume we free'd something.
*
* NOTE NOTE NOTE! This should just set a
* dirty bit in page_map, and just drop the
* pte. All the hard work would be done by
* shrink_mmap().
*
* That would get rid of a lot of problems.
*/
flush_cache_page(vma, address);
if (vma->vm_ops && vma->vm_ops->swapout) {
pid_t pid = tsk->pid;
pte_clear(page_table);
flush_tlb_page(vma, address);
vma->vm_mm->rss--;
if (vma->vm_ops->swapout(vma, page_map))
kill_proc(pid, SIGBUS, 1);
__free_page(page_map);
return 1;
}
/*
* This is a dirty, swappable page. First of all,
* get a suitable swap entry for it, and make sure
* we have the swap cache set up to associate the
* page with that swap entry.
*/
entry = get_swap_page();
if (!entry)
return 0; /* No swap space left */
vma->vm_mm->rss--;
tsk->nswap++;
set_pte(page_table, __pte(entry));
flush_tlb_page(vma, address);
swap_duplicate(entry); /* One for the process, one for the swap cache */
add_to_swap_cache(page_map, entry);
/* We checked we were unlocked way up above, and we
have been careful not to stall until here */
set_bit(PG_locked, &page_map->flags);
### M. Maletinsky:
This I think is the point your (Raghav) mentioned in your mail. Why do you write a page to disk, which potentially may still have a reference from another process?
Wouldn't it make more sense to write it to disk only once, when the last reference is dropped?
/* OK, do a physical asynchronous write to swap. */
rw_swap_page(WRITE, entry, (char *) page, 0);
__free_page(page_map);
return 1;
}
---------------------------------
2.4.2
static void try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, struct page *page)
{
pte_t pte;
swp_entry_t entry;
/* Don't look at this pte if it's been accessed recently. */
if (ptep_test_and_clear_young(page_table)) {
page->age += PAGE_AGE_ADV;
if (page->age > PAGE_AGE_MAX)
page->age = PAGE_AGE_MAX;
return;
}
if (TryLockPage(page))
return;
/* From this point on, the odds are that we're going to
* nuke this pte, so read and clear the pte. This hook
* is needed on CPUs which update the accessed and dirty
* bits in hardware.
*/
pte = ptep_get_and_clear(page_table);
flush_tlb_page(vma, address);
/*
* Is the page already in the swap cache? If so, then
* we can just drop our reference to it without doing
* any IO - it's already up-to-date on disk.
*/
if (PageSwapCache(page)) {
entry.val = page->index;
### M.Maletinsky:
This seems to fix the problem I mentioned above. However, does that mean, the code in 2.2.18 did not work correctly?
if (pte_dirty(pte))
set_page_dirty(page);
set_swap_pte:
swap_duplicate(entry);
set_pte(page_table, swp_entry_to_pte(entry));
drop_pte:
mm->rss--;
if (!page->age)
deactivate_page(page);
UnlockPage(page);
page_cache_release(page);
return;
}
/*
* Is it a clean page? Then it must be recoverable
* by just paging it in again, and we can just drop
* it..
*
* However, this won't actually free any real
* memory, as the page will just be in the page cache
* somewhere, and as such we should just continue
* our scan.
*
* Basically, this just makes it possible for us to do
* some real work in the future in "refill_inactive()".
*/
flush_cache_page(vma, address);
if (!pte_dirty(pte))
goto drop_pte;
/*
* Ok, it's really dirty. That means that
* we should either create a new swap cache
* entry for it, or we should write it back
* to its own backing store.
*/
if (page->mapping) {
set_page_dirty(page);
goto drop_pte;
}
/*
* This is a dirty, swappable page. First of all,
* get a suitable swap entry for it, and make sure
* we have the swap cache set up to associate the
* page with that swap entry.
*/
entry = get_swap_page();
if (!entry.val)
goto out_unlock_restore; /* No swap space left */
/* Add it to the swap cache and mark it dirty */
add_to_swap_cache(page, entry);
set_page_dirty(page);
goto set_swap_pte;
### M. Maletinsky:
From what I can see, the page is *NOT* written to disk, in contradiction to what you (Raghav) write in your mail.
out_unlock_restore:
set_pte(page_table, pte);
UnlockPage(page);
return;
}
--
Supercomputing System AG email: maletinsky@scs.ch
Martin Maletinsky phone: +41 (0)1 445 16 05
Technoparkstrasse 1 fax: +41 (0)1 445 16 10
CH-8005 Zurich
--
Kernelnewbies: Help each other learn about the Linux kernel.
Archive: http://mail.nl.linux.org/kernelnewbies/
FAQ: http://kernelnewbies.org/faq/