-
Notifications
You must be signed in to change notification settings - Fork 1.8k
measure disk size of directories if available #7120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdded Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
lib/common/common/src/disk.rs (3)
23-26
: Clarify docstring with semantics and caveatsSince this function estimates allocated bytes (vs. logical size), add clarifications to set expectations and aid future maintainers.
Apply this diff:
/// How many bytes a directory takes on disk. /// Notes: -/// - on non-unix systems, this is the same as `dir_size` +/// - On Unix, uses `st_blocks * 512` (POSIX block units), which approximates allocated space and accounts for sparsity. +/// - On non-Unix, falls back to logical length (`metadata.len()`), i.e. same behavior as `dir_size`. +/// - Directory entry blocks are not counted; only file entries are summed. +/// - Symlinks are currently followed (via `metadata()`), which can double-count targets and can recurse into linked directories.
29-43
: Avoid following symlinks to prevent double-counting and cyclesCurrent implementation follows symlinks (via
metadata()
), which can double-count targets or even recurse indefinitely with cyclic links. Treating symlinks as links (size of the link itself) is safer and matches “allocated-bytes-in-this-tree” intuition.Apply this diff to switch to
file.file_type()
andsymlink_metadata()
:- let metadata = file.metadata()?; - let size = if metadata.is_dir() { - dir_disk_size(std::fs::read_dir(file.path())?)? - } else { - #[cfg(unix)] - { - use std::os::unix::fs::MetadataExt; - metadata.blocks() * BLOCK_SIZE - } - #[cfg(not(unix))] - { - metadata.len() - } - }; + let file_type = file.file_type()?; + let size = if file_type.is_dir() { + // Do not follow symlinked directories + dir_disk_size(std::fs::read_dir(file.path())?)? + } else { + // For symlinks, use symlink metadata to avoid following the target + let md = if file_type.is_symlink() { + std::fs::symlink_metadata(file.path())? + } else { + file.metadata()? + }; + #[cfg(unix)] + { + use std::os::unix::fs::MetadataExt; + md.blocks() * BLOCK_SIZE + } + #[cfg(not(unix))] + { + md.len() + } + };If you keep the current “follow symlinks” behavior for parity with
dir_size
, please update the docstring accordingly and consider detecting cycles separately.
27-49
: Optional: de-duplicate hard links to improve accuracy on UnixHard-linked files can be counted multiple times. If this becomes a problem, track visited
(dev, ino)
and only sum the first occurrence.I can draft a Unix-only implementation using a
HashSet<(u64, u64)>
andMetadataExt::{dev, ino}
. Want me to provide a patch?lib/collection/src/collection_manager/optimizers/segment_optimizer.rs (2)
133-133
: Use checked_mul to avoid silent u64 overflow in “2× needed space”While unlikely, extremely large collections could overflow
2 * x
. Falling back to “unknown” and proceeding with the existing warn path is safer.Apply this diff:
- let space_needed = space_occupied.map(|x| 2 * x); + let space_needed = space_occupied.and_then(|x| x.checked_mul(2));
160-166
: Humanize disk-space logs and include temp path for quicker diagnosticsRaw byte counts are hard to read in logs. Add human-readable formatting and mention the path whose space was measured.
Apply this diff:
- log::debug!( - "Available space: {space_available}, needed for optimization: {space_needed}", - ); + log::debug!( + "Available in {}: {}, needed for optimization: {}", + self.temp_path().display(), + human_bytes(space_available), + human_bytes(space_needed), + );And this diff for the error (still within this block):
- return Err(CollectionError::service_error(format!( - "Not enough space available for optimization, needed: {space_needed}, available: {space_available}" - ))); + return Err(CollectionError::service_error(format!( + "Not enough space for optimization: need {}, available {} in {}", + human_bytes(space_needed), + human_bytes(space_available), + self.temp_path().display() + )));Add this small helper near the top of the file (outside the shown hunks):
fn human_bytes(mut v: u64) -> String { const UNITS: [&str; 6] = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"]; let mut unit = 0; while v >= 1024 && unit < UNITS.len() - 1 { v /= 1024; unit += 1; } format!("{v} {}", UNITS[unit]) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
lib/collection/src/collection_manager/optimizers/segment_optimizer.rs
(3 hunks)lib/common/common/src/disk.rs
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
lib/collection/src/collection_manager/optimizers/segment_optimizer.rs (2)
lib/common/common/src/disk.rs (2)
dir_disk_size
(26-49)dir_disk_size
(27-46)lib/collection/src/operations/types.rs (1)
service_error
(1035-1040)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: e2e-tests
- GitHub Check: test-consistency
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: lint
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: storage-compat-test
🔇 Additional comments (3)
lib/common/common/src/disk.rs (1)
21-21
: Add#[cfg(unix)]
to BLOCK_SIZE to prevent dead_code warnings on non-Unix targetsApplying a target-guard to this constant avoids unused-code warnings (or CI failures if you later enable warnings-as-errors) on Windows and other non-Unix platforms. Please confirm that your CI doesn’t already deny warnings globally—if it does, this change is critical; otherwise it’s still recommended to keep the code warning-free.
Suggested diff (in
lib/common/common/src/disk.rs
):-const BLOCK_SIZE: u64 = 512; // 512 bytes per block, aka DEV_BSIZE +#[cfg(unix)] +const BLOCK_SIZE: u64 = 512; // st_blocks are in 512-byte units (POSIX)• Please manually verify whether your CI or build scripts ever pass
-D warnings
(e.g., viadeny(warnings)
in code,RUSTFLAGS
, or workflow files). If you do enforce warnings as errors, this guard is mandatory; otherwise, it’s an optional refactor to keep builds clean.
• If you need help locating where warnings might be elevated to errors, you can re-run a repository-wide search for denial of warnings, for example:rg -n 'deny\s*\(\s*warnings\s*\)' -t rs -S rg -n '-D\s*warnings' -S rg -n 'RUSTFLAGS.*-D warnings' -g '.github/**' -Slib/collection/src/collection_manager/optimizers/segment_optimizer.rs (2)
9-9
: LGTM: using dir_disk_size aligns the estimate with real on-disk usageThis directly addresses sparse-file overestimation and should reduce false “not enough disk” errors on Unix filesystems.
119-131
: Graceful degradation on size-read errors looks goodThe
Option
accumulator with debug logging keeps behavior predictable when some segments can’t be sized. No change requested.Confirm segments never contain symlinks into other trees; if they do, pairing this with the
dir_disk_size
symlink-handling change will avoid double-counting.
lib/common/common/src/disk.rs
Outdated
@@ -18,6 +18,36 @@ pub fn dir_size(path: impl Into<PathBuf>) -> std::io::Result<u64> { | |||
dir_size(std::fs::read_dir(path.into())?) | |||
} | |||
|
|||
const BLOCK_SIZE: u64 = 512; // 512 bytes per block, aka DEV_BSIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be under #[cfg(unix)]
(or moved into the same code block as call to metadata.blocks()
), otherwise it produces a warning on Windows. (and our CI doesn't catch it)
https://github.com/qdrant/qdrant/actions/runs/17166548660/job/48708117198?pr=7120#step:9:42
warning: constant `BLOCK_SIZE` is never used
--> lib\common\common\src\disk.rs:21:7
|
21 | const BLOCK_SIZE: u64 = 512; // 512 bytes per block, aka DEV_BSIZE
| ^^^^^^^^^^
|
= note: `#[warn(dead_code)]` on by default
/// How many bytes a directory takes on disk. | ||
/// Notes: | ||
/// - on non-unix systems, this is the same as `dir_size` | ||
pub fn dir_disk_size(path: impl Into<PathBuf>) -> std::io::Result<u64> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to drop the old fn dir_size()
. I don't see an use-case for the size of the directory in apparent bytes rather than disk usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
lib/common/common/src/disk.rs (4)
9-9
: Avoid shadowing: rename the inner helper for readability.The nested helper has the same name as the public function. That’s easy to misread and invites bugs during future edits.
Minimal rename:
- fn dir_disk_size(mut dir: std::fs::ReadDir) -> std::io::Result<u64> { + fn dir_disk_size_inner(mut dir: std::fs::ReadDir) -> std::io::Result<u64> { @@ - dir_disk_size(std::fs::read_dir(path.into())?) + dir_disk_size_inner(std::fs::read_dir(path.into())?)Also applies to: 31-31
12-16
: Optional: include directory entries’ own on‑disk size to better matchdu -s
.Right now, directory sizes (their own metadata blocks) are not counted; only their contents are. If you want closer parity with
du
, add the directory’sst_blocks * 512
when descending on Unix.Illustration (to combine with the symlink fix if you adopt it):
- let size = if ft.is_dir() { - // Do not follow symlinks to directories; only traverse real dirs. - dir_disk_size(std::fs::read_dir(path)?)? + let size = if ft.is_dir() { + #[cfg(unix)] + { + use std::os::unix::fs::MetadataExt; + let self_blocks = std::fs::metadata(&path)?.blocks() * 512; + self_blocks + dir_disk_size(std::fs::read_dir(path)?)? + } + #[cfg(not(unix))] + { + dir_disk_size(std::fs::read_dir(path)?)? + }
22-25
: Windows accuracy caveat (non‑blocking): consider a futurewindows
feature to use real allocated size.On Windows,
metadata.len()
reports apparent size; sparse/compressed files will still be overestimated. If/when acceptable, a Windows‑only path could useGetCompressedFileSizeW
to approximate allocated bytes. Not needed for this PR’s goal but worth tracking.Happy to sketch a gated implementation behind
#[cfg(windows)]
and a smallwinapi
/windows
dependency if you want a follow‑up PR.
5-8
: Refinedir_disk_size
documentation and signature
- lib/common/common/src/disk.rs
- Update the doc comment to remove the legacy
dir_size
reference and make it fully self-contained.- Change the public signature from
impl Into<PathBuf>
toimpl AsRef<Path>
and adjust the recursive call site frompath.into()
topath.as_ref()
.- Remove the unused
use walkdir::WalkDir;
import.- lib/collection/src/collection_manager/optimizers/segment_optimizer.rs
- No changes required here—
locked_segment.data_path()
still implementsAsRef<Path>
, so the call site at line 120 continues to work as-is.--- a/lib/common/common/src/disk.rs +++ b/lib/common/common/src/disk.rs @@ 1,9c1,14 -use std::path::{Path, PathBuf}; - -use walkdir::WalkDir; - -/// How many bytes a directory takes on disk. -/// Notes: -/// - on non-unix systems, this is the same as `dir_size` -pub fn dir_disk_size(path: impl Into<PathBuf>) -> std::io::Result<u64> { +use std::path::{Path, PathBuf}; + +// TODO: Remove this import if WalkDir is no longer used elsewhere +// use walkdir::WalkDir; + +/// Returns an estimate of how many bytes a directory occupies on disk. +/// +/// - On Unix, uses `st_blocks` (512-byte units) to account for sparse/compressed extents. +/// - On non-Unix targets, falls back to the apparent size via `metadata.len()`. +/// - Symlinks are not followed; the size of the link itself is counted. +pub fn dir_disk_size(path: impl AsRef<Path>) -> std::io::Result<u64> { @@ 29,7c34,7 } - dir_disk_size(std::fs::read_dir(path.into())?) + dir_disk_size(std::fs::read_dir(path.as_ref())?) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
lib/common/common/src/disk.rs
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: test-consensus-compose
- GitHub Check: e2e-tests
- GitHub Check: test-consistency
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: storage-compat-test
- GitHub Check: integration-tests
- GitHub Check: lint
🔇 Additional comments (1)
lib/common/common/src/disk.rs (1)
18-21
: Scoped BLOCK_SIZE under cfg(unix) — good fix; this resolves the Windows dead_code warning.Keeping the 512‑byte unit constant inside the Unix branch avoids the previous Windows warning noted in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/common/common/src/disk.rs (1)
47-47
: Don’t follow symlinks here; it contradicts the doc and risks loops/duplication.
follow_links(true)
will traverse into symlinked directories, potentially causing cycles and returning files multiple times. Since the function’s notes say “symlinks are considered to be files,” we should not descend into them.- for entry in WalkDir::new(dir).min_depth(1).follow_links(true) { + for entry in WalkDir::new(dir).min_depth(1) {
♻️ Duplicate comments (1)
lib/common/common/src/disk.rs (1)
13-27
: Do not follow symlinks; prevent cycles and double-counting; also avoid shadowing the outer fn and include directory entry size.
file.metadata()
follows symlinks; a symlink to a directory will be treated as a directory and recursed into, which can create cycles (e.g., link → parent) and double‑count sizes.- The inner helper shadows the public
dir_disk_size
; this is confusing and error‑prone.- Current logic doesn’t include the directory entry’s own allocated size (Unix
st_blocks
) — underestimates disk usage, especially with many directories.Refactor to branch on
file_type()
, usesymlink_metadata()
for links, avoid following symlinked dirs, include directory entry allocation on Unix, and rename the inner helper:- fn dir_disk_size(mut dir: std::fs::ReadDir) -> std::io::Result<u64> { + fn dir_disk_size_inner(mut dir: std::fs::ReadDir) -> std::io::Result<u64> { dir.try_fold(0, |acc, file| { let file = file?; - let metadata = file.metadata()?; - let size = if metadata.is_dir() { - dir_disk_size(std::fs::read_dir(file.path())?)? - } else { - #[cfg(unix)] - { - const BLOCK_SIZE: u64 = 512; // aka DEV_BSIZE - use std::os::unix::fs::MetadataExt; - metadata.blocks() * BLOCK_SIZE - } - #[cfg(not(unix))] - { - metadata.len() - } - }; + let ft = file.file_type()?; + let path = file.path(); + let size = if ft.is_dir() { + // Count the directory entry itself; do NOT follow symlinked dirs. + let meta = std::fs::symlink_metadata(&path)?; + #[cfg(unix)] + { + use std::os::unix::fs::MetadataExt; + let dir_self = meta.blocks() * 512; + dir_self + dir_disk_size_inner(std::fs::read_dir(&path)?)? + } + #[cfg(not(unix))] + { + let dir_self = meta.len(); + dir_self + dir_disk_size_inner(std::fs::read_dir(&path)?)? + } + } else if ft.is_symlink() { + // Count the link itself to avoid cycles/double-counting. + let meta = std::fs::symlink_metadata(&path)?; + #[cfg(unix)] + { + use std::os::unix::fs::MetadataExt; + meta.blocks() * 512 + } + #[cfg(not(unix))] + { + meta.len() + } + } else { + let meta = std::fs::metadata(&path)?; + #[cfg(unix)] + { + use std::os::unix::fs::MetadataExt; + meta.blocks() * 512 + } + #[cfg(not(unix))] + { + meta.len() + } + }; Ok(acc + size) }) } @@ - dir_disk_size(std::fs::read_dir(path.into())?) + dir_disk_size_inner(std::fs::read_dir(path.into())?)Also applies to: 9-16, 32-32
🧹 Nitpick comments (5)
lib/common/common/src/disk.rs (5)
5-9
: Tighten docs: state symlink and directory-entry semantics explicitly.Make it clear that symlinks are not followed and that (on Unix) directory entries’ own allocations are included.
/// How many bytes a directory takes on disk. /// /// Note: on non-unix systems, this function returns the apparent/logical /// directory size rather than actual disk usage. +/// +/// Symlink semantics: +/// - Symlinks are not followed during traversal; the link itself is counted. +/// - On Unix, sizes include directory entries (st_blocks) and file allocations.
37-44
: Docs vs behavior mismatch for non-directories and symlinked dirs.
- Docs say “a directory must be given,” but non-dirs return an empty list. Either error or document the behavior. Given current code, document it.
- Clarify that symlinked directories are not traversed (paired with the change above).
/// Notes: -/// - a directory must be given -/// - symlinks are considered to be files +/// - If the path is not a directory, returns an empty list. +/// - Symlinks are considered to be files and are returned as entries; symlinked directories are not traversed.
9-9
: API ergonomics: prefer AsRef over Into.Using
AsRef<Path>
avoids forcing a heap allocation and matcheslist_files
. Since this is a new API, now is the cheapest time to adjust.-pub fn dir_disk_size(path: impl Into<PathBuf>) -> std::io::Result<u64> { +pub fn dir_disk_size(path: impl AsRef<Path>) -> std::io::Result<u64> { @@ - dir_disk_size_inner(std::fs::read_dir(path.into())?) + dir_disk_size_inner(std::fs::read_dir(path.as_ref())?)If you adopt this, you can also drop
PathBuf
from theuse
list at Line 1.Also applies to: 32-32
11-29
: Optional: consider de-duplicating hard links on Unix.In trees with many hard links, summing
st_blocks
for each entry will overcount. Tracking aHashSet<(dev, ino)>
prevents double-counting, aligning with typicaldu
behavior.If you want, I can draft a Unix-only version that tracks seen inodes with
std::os::unix::fs::MetadataExt::{ino, dev}
and falls back to the current behavior elsewhere.
1-56
: Add targeted tests for symlinks, sparse files, and directory entries (Unix).
- Sparse file: verify that reported size uses allocated blocks (Unix) and differs from
len()
.- Symlinked file and symlinked dir: ensure links are counted once and not traversed.
- Deep tree with many small directories: verify that directory entry sizes are included on Unix.
I can add integration tests under
#[cfg(unix)]
that create a temp tree withstd::os::unix::fs::symlink
and sparse files viaseek + write
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
lib/common/common/src/disk.rs
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: e2e-tests
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: lint
* measure disk size of directories if available * review suggestions * Update comment --------- Co-authored-by: xzfc <xzfcpw@gmail.com>
If we merge a large amount of empty segments, logical disk size might be much higher than real disk size due to sparse files we use there.
In some corner cases it might cause not-enough-disk-error, which is not actually true.
This PR tries to use block-based file size estimation, which is supposed to closer to real size.