-
Notifications
You must be signed in to change notification settings - Fork 128
Local object cache for cached_httpfs extension #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c891c27
to
47cde7d
Compare
FYI, if first connecton is caching remote file:
Until this file is downloaded all other queries on same remote file will fail as:
|
8f948c9
to
e1572de
Compare
This seems useful in general, and not specific to pg_duck. How about doing this in the main duckdb repo, or as a separate extension? |
I understand the premise here, but ideally the query would detect the lock and then continue but access the file "directly" as if the caching wasn't present. Another thing to consider: what if the original |
The test fails because we changed how replacement scan names are assigned, const string &catalog_name;
const string &schema_name;
const string &table_name; To get back the old behavior, you can use |
I'd like to see this PR get split in 3:
|
Yeah, I think it makes much more sense to include this as a feature in the main httpfs extension. As opposed to maintaining a fork of that extension in this repo. |
Typically duckdb does not need this type of caching because when you use duckdb, you have access to the local filesystem. We can consider contributing it back upstream at some point but the decision was made to work on this here since it's more relevant to this use case.
I get the point of being able to have the original code so we can diff it. Keeping this up to date with the upstream code is definitely a concern, but working with patches doesn't sound like fun either. |
I don't understand this argument (but I might be missing something). Afaict the data requested by httpfs is by definition not on the local filesystem (unless you use httpfs to connect to some webserver on localhost ofcourse). So providing a way to cache remote files onto the local filesystem seems pretty useful even for plain duckdb users of httpfs. |
c00f67a
to
dbde53e
Compare
src/pgduckdb_utils.cpp
Outdated
appendStringInfo(duckdb_data_directory, "%s/%s", DataDir, directory_name.c_str()); | ||
|
||
if (!CheckDirectory(duckdb_data_directory->data)) { | ||
if (mkdir(duckdb_data_directory->data, S_IRWXU | S_IRWXG | S_IRWXO) == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use MakePGDirectory
instead
if (mkdir(duckdb_data_directory->data, S_IRWXU | S_IRWXG | S_IRWXO) == -1) { | |
if (MakePGDirectory(duckdb_data_directory->data) == -1) { |
src/pgduckdb_utils.cpp
Outdated
} // namespace pgduckdb | ||
|
||
void | ||
DuckdbCreateCacheDirectory() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PG_DATADIR/duckdb_cache will be used as cache directory
Is the reason that you're choosing to create it in $PGDATA for convenience (e.g. directory guaranteed to exist with correct permission) or are there other reasons?
I think one implication of this is that backups such as pg_basebackup will also backup these cached files. I'm not sure if that's intentional or not, but if you expect the cache to be relatively large compared to the actual Postgres data files this might bloat backup size?
Another wrinkle would be if users are running pg_upgrade with --link mode, they would need to do some processing beforehand to retain the cached files but maybe that's not as big of a deal since it's relatively rare.
e.g.
mv $PGDATA/duckdb_cache duckdb_cache
pg_upgrade --old-datadir $PGDATA --new-datadir $NEW_PGDATA
mv duckdb_cache $NEW_PGDATA/duckdb_cache
Could always make the path configurable alongside #105 I suppose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think PG_DATADIR is the only thing we can be assured that Postgres can write to, aside from mktemp
. But the intent was not to have that directory persist in backups, so that's a good point. A configurable option would be good.
acd0366
to
8b05b25
Compare
d72d5df
to
df9d073
Compare
* To cache remote object, user need to explicilty use `duckdb.cache(path, type)` function. Path is remote HTTPFS/S3/GCS/R2 object path and type is either `parquet` or `csv` indicating remote object type. Caching will not be triggered on normal `SELECT` queries. * When file is currently being cached and if we run in parallel another query targeting same remote object - second query will fail because it will not be able to take read lock. * PG_DATADIR/duckdb_cache will be used as cache directory
* When MetadataCache for file is found we need to reuse Cached File information for reading in thread.
df9d073
to
75e0018
Compare
We had implemented our own http caching in #88 when nothing like that existed yet. Now there are two different threads that are implementing something like this in a better way than we have done. So this PR removes our custom caching code from our codebase. Instead in a future release we'll make sure that pg_duckdb works well with these at least one of these other caching solutions.
We had implemented our own http caching in #88 when nothing like that existed yet. Now there are two different threads that are implementing something like this in a better way than we have done. So this PR removes our custom caching code from our codebase. Instead in a future release we'll make sure that pg_duckdb works well with these at least one of these other caching solutions. For reference the two projects that are currently implementing caching are: 1. An official in-memory cache by DuckDB, which will eventually improved to also support on-disk caching: duckdb/duckdb#16463 2. A community provided extension called `cache_httpfs`: https://github.com/dentiny/duck-read-cache-fs
We had implemented our own http caching in #88 when nothing like that existed yet. Now there are two different threads that are implementing something like this in a better way than we have done. So this PR removes our custom caching code from our codebase. Instead in a future release we'll make sure that pg_duckdb works well with these at least one of these other caching solutions. For reference the two projects that are currently implementing caching are: 1. An official in-memory cache by DuckDB, which will eventually improved to also support on-disk caching: duckdb/duckdb#16463 2. A community provided extension called `cache_httpfs`: https://github.com/dentiny/duck-read-cache-fs
We had implemented our own http caching in #88 when nothing like that existed yet. Now there are two different threads that are implementing something like this in a better way than we have done. So this PR removes our custom caching code from our codebase. Instead in a future release we'll make sure that pg_duckdb works well with these at least one of these other caching solutions. For reference the two projects that are currently implementing caching are: 1. An official in-memory cache by DuckDB, which will eventually improved to also support on-disk caching: duckdb/duckdb#16463 2. A community provided extension called `cache_httpfs`: https://github.com/dentiny/duck-read-cache-fs
We had implemented our own http caching in #88 when nothing like that existed yet. Now there are two different threads that are implementing something like this in a better way than we have done. So this PR removes our custom caching code from our codebase. Instead in a future release we'll make sure that pg_duckdb works well with these at least one of these other caching solutions. For reference the two projects that are currently implementing caching are: 1. An official in-memory cache by DuckDB, which will eventually improved to also support on-disk caching: duckdb/duckdb#16463 2. A community provided extension called `cache_httpfs`: https://github.com/dentiny/duck-read-cache-fs --------- Co-authored-by: Yves <yves@motherduck.com>
To cache remote object, user need to explicilty use
duckdb.cache(path, type)
function. Path is remote HTTPFS/S3/GCS/R2 object path and type is eitherparquet
orcsv
indicating remote object type. Caching will not be triggered on normalSELECT
queries.When file is currently being cached and if we run in parallel another query targeting same remote object - second query will fail because it will not be able to take read lock.
PG_DATADIR/duckdb_cache will be used as cache directory