Handle Out-Of-Disk gracefully #4165

kemkemG0 · 2024-05-03T08:59:38Z

Description

Set a disk usage threshold, and perform a threshold(disk_buffer) check for Create and Update operations.
I haven't decided what value to set for disk_buffer , but disk_buffer = wal_capacity_mb might be a good idea, as I searched here (Handle Out-Of-Disk gracefully #4108 (comment))
I introduced a new crate (https://github.com/danburkert/fs2-rs) for getting free disk space where the specified path exists.
Also, I updated the test created in tests: Add test on low disk #4105 so that it ensures it can search while out of disk.

Concern

TODO ?

I wasn't sure If I should fix collection create, vector create and snapshot create in this PR, but i'm willing to fix them if needed : )

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

…alShard operations

kemkemG0 · 2024-05-04T00:10:39Z

tests/low-disk/create_and_search_items.py

+    for points_batch in generate_points(points_amount):
+        insert_points(qdrant_host, collection_name, points_batch)
+        search_point(qdrant_host, collection_name)
+


Ensure that search works after insertion

kemkemG0 · 2024-05-04T00:12:28Z

tests/low-disk/create_and_search_items.py

+    )
+    EXPECTED_ERROR_MESSAGE = "No space left on device"
+    if resp.status_code != 200:
+        if resp.status_code == 500 and EXPECTED_ERROR_MESSAGE in resp.text:
+            requests.put(f"{qdrant_host}/collections/{collection_name}/points?wait=true", json=batch_json)
+        else:
+            error_response = resp.json()
+            print(f"Points insertions failed with response body:\n{error_response}")
+            exit(-2)
+


Either status_code:200 or status_code:500 + EXPECTED_ERROR_MESSAGE is expected

kemkemG0 · 2024-05-04T00:23:37Z

@generall @timvisee
Hi, I think I'm ready for review ! :)

generall · 2024-05-05T10:20:39Z

Hey @kemkemG0, thanks for the PR!

Before proceeding with the merge, I have a couple of follow-up questions regarding the proposed solution:

It seems like fs2 depends on libc: https://github.com/danburkert/fs2-rs?tab=readme-ov-file#platforms, does it mean that our MUSL builds won't work? If so, is it possible to make this dependency and overall disk check optional?
I tested the latency of the check - on my system (local SSD, ext4) it takes about 10µs average, which is fine. Is there an estimation what should we expect on different systems? Especially network-mounted disks. It it might have impact on the overall performance, is there a way to make those checks less frequent?

generall · 2024-05-05T10:57:13Z

lib/collection/src/shards/local_shard/shard_ops.rs

+    /// - The disk space retrieval fails, detailing the failure reason.
+    /// - The available space is less than the configured WAL buffer size, specifying both the available and required space.
+    async fn ensure_sufficient_disk_space(&self) -> CollectionResult<()> {
+        let disk_free_space_bytes = fs2::available_space(self.path.as_path()).map_err(|err| {


calling sync available_space in async context might be not so good performance-wise

generall · 2024-05-05T10:59:46Z

I would take a look at https://github.com/al8n/fs4-rs

kemkemG0 · 2024-05-05T14:38:55Z

Hi @generall !

I took a look at fs4 and it seems great; it uses rustix and should be able to build with MUSL as well. I'll proceed with switching from fs2 to fs4. : )

While I'm not an expert on file systems or the statvfs syscall, I believe the increase in response time can be considered as the sum of one RTT (Round-Trip Time) + one disk lookup and I assume these statistics are typically cached in memory?

Also, quoting from a blog by tokio maintainer ryhl:

"A good rule of thumb is no more than 10 to 100 microseconds between each .await."

Your benchmark shows about 10µs, and it can be expected that responses on other systems will also come back within 100µs. Therefore, I believe that keeping it synchronous might be fine, but wrapping it with tokio::task::spawn_blocking might also be a good option.

ps. updated !!
Use fs2 -> fs4 and offload sync IO with tokio::task::spawn_blocking

As a means to reduce these checks, I thought about changing the WAL flush function so that checks are only performed when wal's flushing. This would definitely reduce the number of calls, but since the WAL is used as an external crate, it could not be modified. If it were to be changed, it would involve forking and modifying the flush function to include the checks, but I'm worried this might be too complex for what we want to achieve.

Please let me know your thoughts or if there are any other aspects we should consider.

…ing`

generall · 2024-05-06T13:32:39Z

As a means to reduce these checks, I thought about changing the WAL flush function so that checks are only performed when wal's flushing.

Do you assume that the main source of the failures if WAL creating new segments?

timvisee · 2024-05-06T14:28:57Z

Do you assume that the main source of the failures if WAL creating new segments?

IIRC I've also seen OOD failures in RocksDB. But checking for at least WAL capacity probably prevents that from happening too.

timvisee · 2024-05-06T15:22:00Z

lib/collection/src/shards/local_shard/shard_ops.rs

+    async fn ensure_sufficient_disk_space(&self) -> CollectionResult<()> {
+        // Offload the synchronous I/O operation to a blocking thread
+        let path = self.path.clone();
+        let disk_free_space_bytes: u64 =
+            tokio::task::spawn_blocking(move || fs4::available_space(path.as_path()))
+                .await
+                .map_err(|e| {
+                    CollectionError::service_error(format!("Failed to join async task: {}", e))
+                })?
+                .map_err(|err| {
+                    CollectionError::service_error(format!(
+                        "Failed to get free space for path: {} due to: {}",
+                        self.path.as_path().display(),
+                        err
+                    ))
+                })?;
+        let disk_buffer_bytes = self
+            .collection_config
+            .read()
+            .await
+            .wal_config
+            .wal_capacity_mb
+            * 1024
+            * 1024;
+
+        if disk_free_space_bytes < disk_buffer_bytes.try_into().unwrap_or_default() {
+            return Err(CollectionError::service_error(
+                "No space left on device: WAL buffer size exceeds available disk space".to_string(),
+            ));
+        }
+
+        Ok(())
+    }


Requesting the available space may fail. In my opinion, we shouldn't fail but return early with Ok(()) if that happens. That would prevent making Qdrant unusable if some platform or environment doesn't support this call properly.

generall · 2024-05-06T17:05:02Z

Here is some benchmarks on network-attached disk (7500 IOPS)

I am uploading 500k points in batch of 10 points with the following:

docker run --rm -it qdrant/bfb:dev ./bfb -n 500000 -b 10 -d 128 -p 2 --uri 'http://10.0.0.2:6334' --skip-wait-index

Result across 3 runs with low dispersion in results:

dev takes: 27 second to upload
branch takes: 31 second for the same

which gives almost 15% slowdown with the check.

I am afraid that would be too much and we would need to find some better way to implement this

kemkemG0 · 2024-05-07T00:59:57Z

@generall @timvisee

I've been considering two possible scenarios:

The write to database itself crush.
After nearly exhausts the disk space, and then the OS issues a system call using the disk and crashes. In fact, when debugging with println!, the function unrelated to the write continues to work until just before it is called, but it doesn't print right after the call, suggesting it's not a panic but a crash.

which gives almost 15% slowdown with the check.

I am afraid that would be too much and we would need to find some better way to implement this

Additionally, I mentioned

I thought about changing the WAL flush function so that checks are only performed when wal's flushing.

This is because in the current my implementation, a statvfs(3) call is made with each update request, but by calling the statvfs syscall only when the wal is flushing, I believe it could be sped up by a factor of wal_size/request_size.

timvisee · 2024-05-07T07:56:14Z

Looking at the above performance test, maybe we can make it less intensive:

periodically query free disk space in a background job
or query on something like 1% of the update operations

I would probably prefer the first to prevent having a big hit on tail latencies. It could be part of our current flush loop, which might actually make sense because it's likely for writes to happen at that time.

I imagine this as a semi-global atomic in which we store the last read amount of free disk space. The current logic in this PR will just read that, versus doing a syscall in all cases.

Once we hit the condition we can query on 100% of the update requests to immediately recover once disk space is available.

@generall Thoughts?

generall · 2024-05-07T20:09:38Z

One alternative: make this check less frequent by using timeout and number of operations.

So for example, check the disk space only if it was more than one second or more than N operations was performed since the last check

xzfc · 2024-05-07T20:45:51Z

but since the WAL is used as an external crate, it could not be modified.

Wal is a Qdrant crate, you could send a PR to it. https://www.github.com/qdrant/wal

kemkemG0 · 2024-05-10T10:52:08Z

@generall @timvisee
To summarize:

Integrate a disk usage check as part of the flush loop to regularly monitor disk usage.
Also, Perform the disk usage check every N iterations.
Save the checked usage by mutating a shared variable.
When updating or inserting into the database, refer to this variable.

I will try proceeding with the implementation based on this plan :)

@generall
However, I feel that 1 might be sufficient, but why is 2 necessary?

kemkemG0 · 2024-05-12T11:26:59Z

@general @timvisee

I implemented DiskUsageWatcher , could you check it please?

BenchMark: 125MBps, 1100IOPS
docker run --rm -it --network host qdrant/bfb:dev ./bfb -n 500000 -b 10 -d 128 -p 2 --uri 'http://127.0.0.1:6334' --skip-wait-index

this branch:
- 65, 71, 64, 60, 64, 63 sec
dev:
- 64, 65, 63, 62, 65, 68 sec

There is some variation, but they appear to be within the same range.

timvisee · 2024-05-13T08:28:38Z

However, I feel that 1 might be sufficient, but why is 2 necessary?

If the upsertion rate in terms of bytes is greater than the WAL size per flush interval, then it might fall through this check.

What are the above numbers? Do the represent the number of seconds for consecutive runs?

kemkemG0 · 2024-05-13T08:36:26Z

@timvisee
I see, then I wonder what the N should be, since every request can different size of insertion and can't decided fixed N :(

What are the above numbers? Do the represent the number of seconds for consecutive runs?
Yeah, they're the benchmarks on my end! I would like you to double check the benchmark as well to confirm if this branch's not that slow.

kemkemG0 · 2024-05-13T08:50:36Z

Another idea that might be realized by extending this implementation is to vary the frequency of updates based on free disk space.

For example, as far as I remember, the limit per request is 30MB, so how about an implementation where if there is more than 50MB(maybe 100MB is safer???) remaining, updates are made only during the flush loop, and if it's below 50MB, updates are made every time?

It may sounds heuristic, but the implementation shouldn't be too complex, and the accuracy is better.

timvisee · 2024-05-13T08:52:45Z

Good question. The request size can be measured to determine this, but that may be over-engineering it.

Maybe we can pick N=1000 for now to refresh every 1000 checks. A thousand vectors with dimensionality 1500 are 6 megabytes. I'd suggest implementing that first. We can fine tune this later.

…pdate logic

kemkemG0 · 2024-05-13T09:11:14Z

@timvisee

It sounds nice, updated with N = 1000 :)

I commented almost simultaneously as you, and probably it wasn't visible to you, so I would also like you to check the previous idea, thanks!

generall · 2024-05-13T09:34:22Z

Hey @kemkemG0, we are going to merge it as-is now. Will also apply some minor updates later.

I like you idea about changing the check interval based on free disk size, I think we are going to follow it

* tests: Add test on low disk * Remove redundant assertion * keep container after failure and print latest logs in console * Add fs2 crate as a dependency and ensure sufficient disk space in LocalShard operations * small fix * Update test * small fix * use available_space * Use `fs2` -> `fs4` and offload sync IO with `tokio::task::spawn_blocking` * create DiskUsageWathcer * chore: Remove unnecessary println statement in update_handler.rs * chore: Fix typo in DiskUsageWatcher struct name * chore: Refactor DiskUsageWatcher to improve disk usage tracking and update logic --------- Co-authored-by: tellet-q <elena.dubrovina@qdrant.com> Co-authored-by: generall <andrey@vasnetsov.com>

tellet-q and others added 7 commits April 24, 2024 09:57

tests: Add test on low disk

1e5714b

Remove redundant assertion

410e74c

keep container after failure and print latest logs in console

05975dc

Merge branch 'test/low-disk-tests' into fix/handle-out-of-disk

4d66627

Add fs2 crate as a dependency and ensure sufficient disk space in Loc…

181874b

…alShard operations

small fix

5913cdb

Update test

72ab634

kemkemG0 commented May 4, 2024

View reviewed changes

small fix

192603d

kemkemG0 marked this pull request as ready for review May 4, 2024 00:17

kemkemG0 changed the title ~~Fix/handle out of disk~~ Handle Out-Of-Disk gracefully May 4, 2024

algora-pbc bot mentioned this pull request May 4, 2024

Handle Out-Of-Disk gracefully #4108

Open

algora-pbc bot added the 🙋 Bounty claim label May 4, 2024

use available_space

ac18147

generall reviewed May 5, 2024

View reviewed changes

Use fs2 -> fs4 and offload sync IO with `tokio::task::spawn_block…

e6bfb06

…ing`

timvisee reviewed May 6, 2024

View reviewed changes

Merge branch 'dev' into fix/handle-out-of-disk

742cef7

kemkemG0 changed the title ~~Handle Out-Of-Disk gracefully~~ [WIP] Handle Out-Of-Disk gracefully May 12, 2024

kemkemG0 added 2 commits May 12, 2024 10:29

create DiskUsageWathcer

f6a8078

chore: Remove unnecessary println statement in update_handler.rs

4e8cd5a

kemkemG0 changed the title ~~[WIP] Handle Out-Of-Disk gracefully~~ Handle Out-Of-Disk gracefully May 12, 2024

chore: Fix typo in DiskUsageWatcher struct name

66a2cb8

chore: Refactor DiskUsageWatcher to improve disk usage tracking and u…

4fe9ca5

…pdate logic

generall approved these changes May 13, 2024

View reviewed changes

generall merged commit a02dd92 into qdrant:dev May 13, 2024

generall mentioned this pull request May 16, 2024

tests: Add test on low disk #4105

Closed

3 tasks

Handle Out-Of-Disk gracefully #4165

Handle Out-Of-Disk gracefully #4165

Uh oh!

Conversation

kemkemG0 commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Concern

TODO ?

All Submissions:

New Feature Submissions:

Changes to Core Features:

Uh oh!

kemkemG0 May 4, 2024

Choose a reason for hiding this comment

Uh oh!

kemkemG0 May 4, 2024

Choose a reason for hiding this comment

Uh oh!

kemkemG0 commented May 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

generall commented May 5, 2024

Uh oh!

generall May 5, 2024

Choose a reason for hiding this comment

Uh oh!

generall commented May 5, 2024

Uh oh!

kemkemG0 commented May 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

generall commented May 6, 2024

Uh oh!

timvisee commented May 6, 2024

Uh oh!

timvisee May 6, 2024

Choose a reason for hiding this comment

Uh oh!

generall commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kemkemG0 commented May 7, 2024

Uh oh!

timvisee commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

generall commented May 7, 2024

Uh oh!

xzfc commented May 7, 2024

Uh oh!

kemkemG0 commented May 10, 2024

Uh oh!

kemkemG0 commented May 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvisee commented May 13, 2024

Uh oh!

kemkemG0 commented May 13, 2024

Uh oh!

kemkemG0 commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvisee commented May 13, 2024

Uh oh!

kemkemG0 commented May 13, 2024

Uh oh!

generall commented May 13, 2024

Uh oh!

Uh oh!

kemkemG0 commented May 3, 2024 •

edited

Loading

kemkemG0 commented May 4, 2024 •

edited

Loading

kemkemG0 commented May 5, 2024 •

edited

Loading

generall commented May 6, 2024 •

edited

Loading

timvisee commented May 7, 2024 •

edited

Loading

kemkemG0 commented May 12, 2024 •

edited

Loading

kemkemG0 commented May 13, 2024 •

edited

Loading