Skip to content

Conversation

kemkemG0
Copy link
Contributor

@kemkemG0 kemkemG0 commented May 3, 2024

fix: #4108
/claim #4108

Description

Concern

TODO ?

I wasn't sure If I should fix collection create, vector create and snapshot create in this PR, but i'm willing to fix them if needed : )

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

Comment on lines +72 to +75
for points_batch in generate_points(points_amount):
insert_points(qdrant_host, collection_name, points_batch)
search_point(qdrant_host, collection_name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure that search works after insertion

Comment on lines +38 to +47
)
EXPECTED_ERROR_MESSAGE = "No space left on device"
if resp.status_code != 200:
if resp.status_code == 500 and EXPECTED_ERROR_MESSAGE in resp.text:
requests.put(f"{qdrant_host}/collections/{collection_name}/points?wait=true", json=batch_json)
else:
error_response = resp.json()
print(f"Points insertions failed with response body:\n{error_response}")
exit(-2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either status_code:200 or status_code:500 + EXPECTED_ERROR_MESSAGE is expected

@kemkemG0 kemkemG0 marked this pull request as ready for review May 4, 2024 00:17
@kemkemG0 kemkemG0 changed the title Fix/handle out of disk Handle Out-Of-Disk gracefully May 4, 2024
@kemkemG0
Copy link
Contributor Author

kemkemG0 commented May 4, 2024

@generall @timvisee
Hi, I think I'm ready for review ! :)

@generall
Copy link
Member

generall commented May 5, 2024

Hey @kemkemG0, thanks for the PR!

Before proceeding with the merge, I have a couple of follow-up questions regarding the proposed solution:

  • It seems like fs2 depends on libc: https://github.com/danburkert/fs2-rs?tab=readme-ov-file#platforms, does it mean that our MUSL builds won't work? If so, is it possible to make this dependency and overall disk check optional?

  • I tested the latency of the check - on my system (local SSD, ext4) it takes about 10µs average, which is fine. Is there an estimation what should we expect on different systems? Especially network-mounted disks. It it might have impact on the overall performance, is there a way to make those checks less frequent?

/// - The disk space retrieval fails, detailing the failure reason.
/// - The available space is less than the configured WAL buffer size, specifying both the available and required space.
async fn ensure_sufficient_disk_space(&self) -> CollectionResult<()> {
let disk_free_space_bytes = fs2::available_space(self.path.as_path()).map_err(|err| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling sync available_space in async context might be not so good performance-wise

@generall
Copy link
Member

generall commented May 5, 2024

I would take a look at https://github.com/al8n/fs4-rs

@kemkemG0
Copy link
Contributor Author

kemkemG0 commented May 5, 2024

Hi @generall !

I took a look at fs4 and it seems great; it uses rustix and should be able to build with MUSL as well. I'll proceed with switching from fs2 to fs4. : )

While I'm not an expert on file systems or the statvfs syscall, I believe the increase in response time can be considered as the sum of one RTT (Round-Trip Time) + one disk lookup and I assume these statistics are typically cached in memory?

Also, quoting from a blog by tokio maintainer ryhl:

"A good rule of thumb is no more than 10 to 100 microseconds between each .await."

Your benchmark shows about 10µs, and it can be expected that responses on other systems will also come back within 100µs. Therefore, I believe that keeping it synchronous might be fine, but wrapping it with tokio::task::spawn_blocking might also be a good option.

ps. updated !!
Use fs2 -> fs4 and offload sync IO with tokio::task::spawn_blocking


As a means to reduce these checks, I thought about changing the WAL flush function so that checks are only performed when wal's flushing. This would definitely reduce the number of calls, but since the WAL is used as an external crate, it could not be modified. If it were to be changed, it would involve forking and modifying the flush function to include the checks, but I'm worried this might be too complex for what we want to achieve.

Please let me know your thoughts or if there are any other aspects we should consider.

@generall
Copy link
Member

generall commented May 6, 2024

As a means to reduce these checks, I thought about changing the WAL flush function so that checks are only performed when wal's flushing.

Do you assume that the main source of the failures if WAL creating new segments?

@timvisee
Copy link
Member

timvisee commented May 6, 2024

Do you assume that the main source of the failures if WAL creating new segments?

IIRC I've also seen OOD failures in RocksDB. But checking for at least WAL capacity probably prevents that from happening too.

Comment on lines 231 to 263
async fn ensure_sufficient_disk_space(&self) -> CollectionResult<()> {
// Offload the synchronous I/O operation to a blocking thread
let path = self.path.clone();
let disk_free_space_bytes: u64 =
tokio::task::spawn_blocking(move || fs4::available_space(path.as_path()))
.await
.map_err(|e| {
CollectionError::service_error(format!("Failed to join async task: {}", e))
})?
.map_err(|err| {
CollectionError::service_error(format!(
"Failed to get free space for path: {} due to: {}",
self.path.as_path().display(),
err
))
})?;
let disk_buffer_bytes = self
.collection_config
.read()
.await
.wal_config
.wal_capacity_mb
* 1024
* 1024;

if disk_free_space_bytes < disk_buffer_bytes.try_into().unwrap_or_default() {
return Err(CollectionError::service_error(
"No space left on device: WAL buffer size exceeds available disk space".to_string(),
));
}

Ok(())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting the available space may fail. In my opinion, we shouldn't fail but return early with Ok(()) if that happens. That would prevent making Qdrant unusable if some platform or environment doesn't support this call properly.

@generall
Copy link
Member

generall commented May 6, 2024

Here is some benchmarks on network-attached disk (7500 IOPS)

I am uploading 500k points in batch of 10 points with the following:

docker run --rm -it qdrant/bfb:dev ./bfb -n 500000 -b 10 -d 128 -p 2 --uri 'http://10.0.0.2:6334' --skip-wait-index

Result across 3 runs with low dispersion in results:

dev takes: 27 second to upload
branch takes: 31 second for the same

which gives almost 15% slowdown with the check.

I am afraid that would be too much and we would need to find some better way to implement this

@kemkemG0
Copy link
Contributor Author

kemkemG0 commented May 7, 2024

@generall @timvisee

I've been considering two possible scenarios:

  1. The write to database itself crush.
  2. After nearly exhausts the disk space, and then the OS issues a system call using the disk and crashes. In fact, when debugging with println!, the function unrelated to the write continues to work until just before it is called, but it doesn't print right after the call, suggesting it's not a panic but a crash.

which gives almost 15% slowdown with the check.

I am afraid that would be too much and we would need to find some better way to implement this

Additionally, I mentioned

I thought about changing the WAL flush function so that checks are only performed when wal's flushing.

This is because in the current my implementation, a statvfs(3) call is made with each update request, but by calling the statvfs syscall only when the wal is flushing, I believe it could be sped up by a factor of wal_size/request_size.

@timvisee
Copy link
Member

timvisee commented May 7, 2024

Looking at the above performance test, maybe we can make it less intensive:

  • periodically query free disk space in a background job
  • or query on something like 1% of the update operations

I would probably prefer the first to prevent having a big hit on tail latencies. It could be part of our current flush loop, which might actually make sense because it's likely for writes to happen at that time.

I imagine this as a semi-global atomic in which we store the last read amount of free disk space. The current logic in this PR will just read that, versus doing a syscall in all cases.

Once we hit the condition we can query on 100% of the update requests to immediately recover once disk space is available.

@generall Thoughts?

@generall
Copy link
Member

generall commented May 7, 2024

One alternative: make this check less frequent by using timeout and number of operations.

So for example, check the disk space only if it was more than one second or more than N operations was performed since the last check

@xzfc
Copy link
Contributor

xzfc commented May 7, 2024

but since the WAL is used as an external crate, it could not be modified.

Wal is a Qdrant crate, you could send a PR to it. https://www.github.com/qdrant/wal

@kemkemG0
Copy link
Contributor Author

@generall @timvisee
To summarize:

  1. Integrate a disk usage check as part of the flush loop to regularly monitor disk usage.
  2. Also, Perform the disk usage check every N iterations.
  3. Save the checked usage by mutating a shared variable.
  4. When updating or inserting into the database, refer to this variable.

I will try proceeding with the implementation based on this plan :)

@generall
However, I feel that 1 might be sufficient, but why is 2 necessary?

@kemkemG0 kemkemG0 changed the title Handle Out-Of-Disk gracefully [WIP] Handle Out-Of-Disk gracefully May 12, 2024
@kemkemG0
Copy link
Contributor Author

kemkemG0 commented May 12, 2024

@general @timvisee

I implemented DiskUsageWatcher , could you check it please?


BenchMark: 125MBps, 1100IOPS
docker run --rm -it --network host qdrant/bfb:dev ./bfb -n 500000 -b 10 -d 128 -p 2 --uri 'http://127.0.0.1:6334' --skip-wait-index

  • this branch:
    • 65, 71, 64, 60, 64, 63 sec
  • dev:
    • 64, 65, 63, 62, 65, 68 sec

There is some variation, but they appear to be within the same range.

@kemkemG0 kemkemG0 changed the title [WIP] Handle Out-Of-Disk gracefully Handle Out-Of-Disk gracefully May 12, 2024
@timvisee
Copy link
Member

However, I feel that 1 might be sufficient, but why is 2 necessary?

If the upsertion rate in terms of bytes is greater than the WAL size per flush interval, then it might fall through this check.

What are the above numbers? Do the represent the number of seconds for consecutive runs?

@kemkemG0
Copy link
Contributor Author

@timvisee
I see, then I wonder what the N should be, since every request can different size of insertion and can't decided fixed N :(

What are the above numbers? Do the represent the number of seconds for consecutive runs?
Yeah, they're the benchmarks on my end! I would like you to double check the benchmark as well to confirm if this branch's not that slow.

@kemkemG0
Copy link
Contributor Author

kemkemG0 commented May 13, 2024

Another idea that might be realized by extending this implementation is to vary the frequency of updates based on free disk space.

For example, as far as I remember, the limit per request is 30MB, so how about an implementation where if there is more than 50MB(maybe 100MB is safer???) remaining, updates are made only during the flush loop, and if it's below 50MB, updates are made every time?

It may sounds heuristic, but the implementation shouldn't be too complex, and the accuracy is better.

@timvisee
Copy link
Member

Good question. The request size can be measured to determine this, but that may be over-engineering it.

Maybe we can pick N=1000 for now to refresh every 1000 checks. A thousand vectors with dimensionality 1500 are 6 megabytes. I'd suggest implementing that first. We can fine tune this later.

@kemkemG0
Copy link
Contributor Author

@timvisee

It sounds nice, updated with N = 1000 :)

I commented almost simultaneously as you, and probably it wasn't visible to you, so I would also like you to check the previous idea, thanks!

@generall
Copy link
Member

Hey @kemkemG0, we are going to merge it as-is now. Will also apply some minor updates later.

I like you idea about changing the check interval based on free disk size, I think we are going to follow it

@generall generall merged commit a02dd92 into qdrant:dev May 13, 2024
@generall generall mentioned this pull request May 16, 2024
3 tasks
generall added a commit that referenced this pull request May 26, 2024
* tests: Add test on low disk

* Remove redundant assertion

* keep container after failure and print latest logs in console

* Add fs2 crate as a dependency and ensure sufficient disk space in LocalShard operations

* small fix

* Update test

* small fix

* use available_space

* Use `fs2` -> `fs4` and offload sync IO with `tokio::task::spawn_blocking`

* create DiskUsageWathcer

* chore: Remove unnecessary println statement in update_handler.rs

* chore: Fix typo in DiskUsageWatcher struct name

* chore: Refactor DiskUsageWatcher to improve disk usage tracking and update logic

---------

Co-authored-by: tellet-q <elena.dubrovina@qdrant.com>
Co-authored-by: generall <andrey@vasnetsov.com>
timonv referenced this pull request in bosun-ai/swiftide Jun 28, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [qdrant/qdrant](https://qdrant.com/)
([source](https://togithub.com/qdrant/qdrant)) | service | patch |
`v1.9.2` -> `v1.9.7` |

---

### Release Notes

<details>
<summary>qdrant/qdrant (qdrant/qdrant)</summary>

### [`v1.9.7`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.7)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.6...v1.9.7)

### Change log

#### Improvements

-
[https://github.com/qdrant/qdrant/pull/4517](https://togithub.com/qdrant/qdrant/pull/4517)
- Do not allow embedding the web UI in an iframe
-
[https://github.com/qdrant/qdrant/pull/4556](https://togithub.com/qdrant/qdrant/pull/4556)
- Include HNSW configuration in snasphots to fix some edge cases

#### Bug fixes

-
[https://github.com/qdrant/qdrant/pull/4555](https://togithub.com/qdrant/qdrant/pull/4555)
- Fix panic on start with sparse index from versions 1.9.3 to 1.9.6
-
[https://github.com/qdrant/qdrant/pull/4551](https://togithub.com/qdrant/qdrant/pull/4551)
- Fix positive/negative points IDs being excluded when using
recommendation search with `lookup_from`

### [`v1.9.6`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.6)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.5...v1.9.6)

### Change log

#### Bug fixes

-
[https://github.com/qdrant/qdrant/pull/4472](https://togithub.com/qdrant/qdrant/pull/4472)
- fix potential panic on recovery sparse vectors from crash
-
[https://github.com/qdrant/qdrant/pull/4426](https://togithub.com/qdrant/qdrant/pull/4426)
- improve error message on missing payload index
-
[https://github.com/qdrant/qdrant/pull/4375](https://togithub.com/qdrant/qdrant/pull/4375)
- fix in-place updates for sparse index
-
[https://github.com/qdrant/qdrant/pull/4523](https://togithub.com/qdrant/qdrant/pull/4523)
- fix missing payload index issue, introduced in v1.9.5

### [`v1.9.5`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.5)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.4...v1.9.5)

### Change log

#### Features

-
[https://github.com/qdrant/qdrant/pull/4254](https://togithub.com/qdrant/qdrant/pull/4254)
- Add pyroscope integration for continuous profiling on demand

#### Improvements

-
[https://github.com/qdrant/qdrant/pull/4309](https://togithub.com/qdrant/qdrant/pull/4309)
- Allow to configure default number of shards per node
-
[https://github.com/qdrant/qdrant/pull/4317](https://togithub.com/qdrant/qdrant/pull/4317)
- Allow to overwrite optimizer settings via config
-
[https://github.com/qdrant/qdrant/pull/4312](https://togithub.com/qdrant/qdrant/pull/4312),
[https://github.com/qdrant/qdrant/pull/4369](https://togithub.com/qdrant/qdrant/pull/4369)
- Improve vector size estimations, making index thresholds more reliable
-
[https://github.com/qdrant/qdrant/pull/4428](https://togithub.com/qdrant/qdrant/pull/4428)
- Improve default maximum segment size, base it on number of CPUs used
for indexing
-
[https://github.com/qdrant/qdrant/pull/4370](https://togithub.com/qdrant/qdrant/pull/4370)
- Use consistent RocksDB settings for both put and remove
-
[https://github.com/qdrant/qdrant/pull/4376](https://togithub.com/qdrant/qdrant/pull/4376)
- Improve ordering of insertions and deletions in RocksDB
-
[https://github.com/qdrant/qdrant/pull/4371](https://togithub.com/qdrant/qdrant/pull/4371)
- Log error if segment flushing failed on drop
-
[https://github.com/qdrant/qdrant/pull/4352](https://togithub.com/qdrant/qdrant/pull/4352)
- Promote REST request processing problems from warning to error
-
[https://github.com/qdrant/qdrant/pull/4368](https://togithub.com/qdrant/qdrant/pull/4368)
- Improve error messages in cases of missing vectors
-
[https://github.com/qdrant/qdrant/pull/4391](https://togithub.com/qdrant/qdrant/pull/4391)
- Improve shard state log message, not strictly related to snapshot
recovery
-
[https://github.com/qdrant/qdrant/pull/4414](https://togithub.com/qdrant/qdrant/pull/4414)
- Improve Dockerfile, don't invalidate caches each commit and allow
debug settings

#### Bug fixes

-
[https://github.com/qdrant/qdrant/pull/4402](https://togithub.com/qdrant/qdrant/pull/4402)
- Fix deadlock caused by concurrent snapshot and optimization
-
[https://github.com/qdrant/qdrant/pull/4411](https://togithub.com/qdrant/qdrant/pull/4411)
- Fix potentially losing vectors on crash by enabling RocksDB WAL
-
[https://github.com/qdrant/qdrant/pull/4416](https://togithub.com/qdrant/qdrant/pull/4416),
[https://github.com/qdrant/qdrant/pull/4440](https://togithub.com/qdrant/qdrant/pull/4440)
- Respect `max_segment_size` on data ingestion with optimizers disabled,
create segments as needed
-
[https://github.com/qdrant/qdrant/pull/4442](https://togithub.com/qdrant/qdrant/pull/4442)
- Fix potentially having bad HNSW links on multithreaded systems

### [`v1.9.4`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.4)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.3...v1.9.4)

### Change log

#### Bug fixes

-
[https://github.com/qdrant/qdrant/pull/4332](https://togithub.com/qdrant/qdrant/pull/4332)
- Fix potentially losing a segment when creating a snapshot with ongoing
updates
-
[https://github.com/qdrant/qdrant/pull/4342](https://togithub.com/qdrant/qdrant/pull/4342)
- Fix potential panic on start if there is no appendable segment
-
[https://github.com/qdrant/qdrant/pull/4328](https://togithub.com/qdrant/qdrant/pull/4328)
- Prevent panic when searching with huge limit

### [`v1.9.3`](https://togithub.com/qdrant/qdrant/releases/tag/v1.9.3)

[Compare
Source](https://togithub.com/qdrant/qdrant/compare/v1.9.2...v1.9.3)

### Change log

#### Improvements

-
[https://github.com/qdrant/qdrant/pull/4165](https://togithub.com/qdrant/qdrant/pull/4165)
- Handle Out-Of-Disk on insertions gracefully
-
[https://github.com/qdrant/qdrant/pull/3964](https://togithub.com/qdrant/qdrant/pull/3964)
- Faster consensus convergence with batched updates
-
[https://github.com/qdrant/qdrant/pull/4301](https://togithub.com/qdrant/qdrant/pull/4301)
- Deduplicate points by ID for custom sharding

#### Bug fixes

-
[https://github.com/qdrant/qdrant/pull/4307](https://togithub.com/qdrant/qdrant/pull/4307)
- Fix overflow panic if scroll limit is usize::MAX
-
[https://github.com/qdrant/qdrant/pull/4322](https://togithub.com/qdrant/qdrant/pull/4322)
- Fix panic with missing sparse vectors after recovery of corrupted
storage

#### Web UI

-
[https://github.com/qdrant/qdrant-web-ui/pull/183](https://togithub.com/qdrant/qdrant-web-ui/pull/183)
- Notification for miss-configured collections

Full change log:
https://github.com/qdrant/qdrant-web-ui/releases/tag/v0.1.26

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/bosun-ai/swiftide).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40MjAuMSIsInVwZGF0ZWRJblZlciI6IjM3LjQyMC4xIiwidGFyZ2V0QnJhbmNoIjoibWFzdGVyIiwibGFiZWxzIjpbXX0=-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants