Skip to content

Conversation

markjdb
Copy link
Contributor

@markjdb markjdb commented Oct 13, 2024

Unconditionally hold the rangelock in zfs_getpages().

Motivation and Context

This is reportedly required for direct I/O support on FreeBSD.

Description

This change modifies zfs_getpages() to uncondtionally acquire the rangelock. To avoid a deadlock, we may need to drop a page busy lock and allocate a new page later on.

How Has This Been Tested?

Smoke testing on FreeBSD.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 14, 2024
@behlendorf
Copy link
Contributor

I'm not familiar with the FreeBSD VM layer but it looks like the mmap_mixed test case manage to trip an assert.

From the "vm2: serial console" log, https://github.com/openzfs/zfs/actions/runs/11315921698/job/31476777580?pr=16643

  ZTS run /usr/local/share/zfs/zfs-tests/tests/functional/mmap/mmap_mixed
panic: VERIFY(vm_page_none_valid(m)) failed
  
  cpuid = 1
  time = 1728878112
  KDB: stack backtrace:
  #0 0xffffffff80b9002d at kdb_backtrace+0x5d
  #1 0xffffffff80b43132 at vpanic+0x132
  #2 0xffffffff82841b6a at spl_panic+0x3a
  #3 0xffffffff8284845b at dmu_read_pages+0x7fb
  #4 0xffffffff82865527 at zfs_freebsd_getpages+0x2e7
  #5 0xffffffff80eddf81 at vnode_pager_getpages+0x41
  #6 0xffffffff80ed45c2 at vm_pager_get_pages+0x22
  #7 0xffffffff80eb18df at vm_fault+0x5ef
  #8 0xffffffff80eb11db at vm_fault_trap+0x6b
  #9 0xffffffff8100ca39 at trap_pfault+0x1d9
  #10 0xffffffff8100bfb2 at trap+0x442
  #11 0xffffffff80fe3828 at calltrap+0x8

@markjdb
Copy link
Contributor Author

markjdb commented Oct 14, 2024

I'm not familiar with the FreeBSD VM layer but it looks like the mmap_mixed test case manage to trip an assert.

From the "vm2: serial console" log, https://github.com/openzfs/zfs/actions/runs/11315921698/job/31476777580?pr=16643

  ZTS run /usr/local/share/zfs/zfs-tests/tests/functional/mmap/mmap_mixed
panic: VERIFY(vm_page_none_valid(m)) failed
  
  cpuid = 1
  time = 1728878112
  KDB: stack backtrace:
  #0 0xffffffff80b9002d at kdb_backtrace+0x5d
  #1 0xffffffff80b43132 at vpanic+0x132
  #2 0xffffffff82841b6a at spl_panic+0x3a
  #3 0xffffffff8284845b at dmu_read_pages+0x7fb
  #4 0xffffffff82865527 at zfs_freebsd_getpages+0x2e7
  #5 0xffffffff80eddf81 at vnode_pager_getpages+0x41
  #6 0xffffffff80ed45c2 at vm_pager_get_pages+0x22
  #7 0xffffffff80eb18df at vm_fault+0x5ef
  #8 0xffffffff80eb11db at vm_fault_trap+0x6b
  #9 0xffffffff8100ca39 at trap_pfault+0x1d9
  #10 0xffffffff8100bfb2 at trap+0x442
  #11 0xffffffff80fe3828 at calltrap+0x8

Thanks, I think I see what's going on there. I didn't see that panic when I ran the ZTS locally, but if I understand the problem correctly, it's due to a race.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Oct 20, 2024
mappedread_sf() may allocate pages; if it fails to populate a page
can't free it, it needs to ensure that it's placed into a page queue,
otherwise it can't be reclaimed until the vnode is destroyed.

I think this is quite unlikely to happen in practice, it was noticed by
code inspection.

Signed-off-by: Mark Johnston <markj@FreeBSD.org>
@markjdb
Copy link
Contributor Author

markjdb commented Nov 8, 2024

I'm not familiar with the FreeBSD VM layer but it looks like the mmap_mixed test case manage to trip an assert.
From the "vm2: serial console" log, https://github.com/openzfs/zfs/actions/runs/11315921698/job/31476777580?pr=16643

  ZTS run /usr/local/share/zfs/zfs-tests/tests/functional/mmap/mmap_mixed
panic: VERIFY(vm_page_none_valid(m)) failed
  
  cpuid = 1
  time = 1728878112
  KDB: stack backtrace:
  #0 0xffffffff80b9002d at kdb_backtrace+0x5d
  #1 0xffffffff80b43132 at vpanic+0x132
  #2 0xffffffff82841b6a at spl_panic+0x3a
  #3 0xffffffff8284845b at dmu_read_pages+0x7fb
  #4 0xffffffff82865527 at zfs_freebsd_getpages+0x2e7
  #5 0xffffffff80eddf81 at vnode_pager_getpages+0x41
  #6 0xffffffff80ed45c2 at vm_pager_get_pages+0x22
  #7 0xffffffff80eb18df at vm_fault+0x5ef
  #8 0xffffffff80eb11db at vm_fault_trap+0x6b
  #9 0xffffffff8100ca39 at trap_pfault+0x1d9
  #10 0xffffffff8100bfb2 at trap+0x442
  #11 0xffffffff80fe3828 at calltrap+0x8

Thanks, I think I see what's going on there. I didn't see that panic when I ran the ZTS locally, but if I understand the problem correctly, it's due to a race.

I believe the latest version addresses this, and I didn't observe any panics while running the ZTS locally.

As a deadlock avoidance measure, zfs_getpages() would only try to
acquire a rangelock, falling back to a single-page read if this was not
possible.  However, this is incompatible with direct I/O.

Instead, release the busy lock before trying to acquire the rangelock in
blocking mode.  This means that it's possible for the page to be
replaced, so we have to re-lookup.

Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Copy link
Contributor

@bwatkinson bwatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. I am not an expert in FreeBSD, but acquiring the range lock unconditionally in zfs_getpages() was the goal. It also is now passing the CI tests for FreeBSD for the dio_mmap test case.

@bwatkinson bwatkinson requested a review from amotin November 13, 2024 00:50
@amotin amotin removed the Status: Work in Progress Not yet ready for general review label Nov 13, 2024
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 13, 2024
behlendorf pushed a commit that referenced this pull request Nov 13, 2024
As a deadlock avoidance measure, zfs_getpages() would only try to
acquire a rangelock, falling back to a single-page read if this was not
possible.  However, this is incompatible with direct I/O.

Instead, release the busy lock before trying to acquire the rangelock in
blocking mode.  This means that it's possible for the page to be
replaced, so we have to re-lookup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes #16643
Comment on lines +4020 to +4025
count1 = j - i;
last_size = j == count ?
MIN(end, obj_size) - (end - PAGE_SIZE) : PAGE_SIZE;
error = dmu_read_pages(zfsvfs->z_os, zp->z_id, &ma[i], count1,
i == 0 ? &pgsin_b : NULL, j == count ? &pgsin_a : NULL,
last_size);
Copy link
Member

@amotin amotin Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, once we read several (count1) pages at once, shall we also increment i by the same number? I guess it may be fine via vm_page_any_valid() continue above, and it seems to be cheap, but do we need those iterations at all?

But more serious: dmu_read_pages() seems to not expect NULL for rbehind and rahead arguments, requiring pointer to zeroes instead. @markjdb I think this may end in NULL pointer dereference in whatever scenario this page skipping algorithm handles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking over this again, I am seeing the same thing. I overlooked this originally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I convinced myself that a null pointer was ok. Probably because zfs_getpages() itself permits null readahead/behind pointers.

I submitted a PR to fix this, thanks for taking another look: #16758

behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 15, 2024
mappedread_sf() may allocate pages; if it fails to populate a page
can't free it, it needs to ensure that it's placed into a page queue,
otherwise it can't be reclaimed until the vnode is destroyed.

I think this is quite unlikely to happen in practice, it was noticed by
code inspection.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes openzfs#16643
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 15, 2024
As a deadlock avoidance measure, zfs_getpages() would only try to
acquire a rangelock, falling back to a single-page read if this was not
possible.  However, this is incompatible with direct I/O.

Instead, release the busy lock before trying to acquire the rangelock in
blocking mode.  This means that it's possible for the page to be
replaced, so we have to re-lookup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes openzfs#16643
ixhamza pushed a commit to truenas/zfs that referenced this pull request Dec 2, 2024
mappedread_sf() may allocate pages; if it fails to populate a page
can't free it, it needs to ensure that it's placed into a page queue,
otherwise it can't be reclaimed until the vnode is destroyed.

I think this is quite unlikely to happen in practice, it was noticed by
code inspection.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes openzfs#16643
ixhamza pushed a commit to truenas/zfs that referenced this pull request Dec 2, 2024
As a deadlock avoidance measure, zfs_getpages() would only try to
acquire a rangelock, falling back to a single-page read if this was not
possible.  However, this is incompatible with direct I/O.

Instead, release the busy lock before trying to acquire the rangelock in
blocking mode.  This means that it's possible for the page to be
replaced, so we have to re-lookup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes openzfs#16643
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Jan 26, 2025
mappedread_sf() may allocate pages; if it fails to populate a page
can't free it, it needs to ensure that it's placed into a page queue,
otherwise it can't be reclaimed until the vnode is destroyed.

I think this is quite unlikely to happen in practice, it was noticed by
code inspection.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes openzfs#16643
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Jan 26, 2025
As a deadlock avoidance measure, zfs_getpages() would only try to
acquire a rangelock, falling back to a single-page read if this was not
possible.  However, this is incompatible with direct I/O.

Instead, release the busy lock before trying to acquire the rangelock in
blocking mode.  This means that it's possible for the page to be
replaced, so we have to re-lookup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Mark Johnston <markj@FreeBSD.org>
Closes openzfs#16643
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants