How to connect to multiple s3 buckets? #4460
Replies: 10 comments 6 replies
-
This works great but is it possible to pass them as part of the read_csv_auto() function or some other way? |
Beta Was this translation helpful? Give feedback.
-
Hi @cookiejest, we currently don't support having different sets of credentials at the same time for different files. To fetch data with different credentials you will need to switch the credentials between the queries:
you can also copy (parts of) a remote table after which you can query it without credentials
|
Beta Was this translation helpful? Give feedback.
-
ok i thought about that, im new to duckdb, what is the performance of this like? Does it essentially copy the whole table into memory by doing a CREATE TABLE command or can it still work on tables bigger than RAM memory? Thanks! |
Beta Was this translation helpful? Give feedback.
-
@cookiejest yes this does do a copy. You can start duckdb in two ways: as a transient in-memory database or as a persistent database by passing a path: |
Beta Was this translation helpful? Give feedback.
-
Running into a similar situation and wondering if this issue could be turned into a feature request for some kind of mechanism to make this a bit more convenient. For example, what about allowing a "compound setting" for an S3_URI following a connectionstring like format (similar to what Arrow supports, ie a string looking like s3://[access_key:secret_key@]bucket/path[?region=]). This in combination with a duckdb function to allow reading values for environment variables ("getenv" utility function) might be one mechanism to facilitate reading from multiple S3 sources? |
Beta Was this translation helpful? Give feedback.
-
One of the design goals of DuckDB is to not depend on environment variables, so we would need an alternative method! |
Beta Was this translation helpful? Give feedback.
-
can it be done using a forloop like in postgresql? BEGIN |
Beta Was this translation helpful? Give feedback.
-
@Alex-Monahan I'm not suggesting to introduce a dependency on environment variables, but feel that it would be great if there was a function that would simplify using data stored in an environment variable. There are functions to read data from csv, json, stdin so why not allow reading from a system environment variable? There is the "dot command" .shell that can be used already (both in duckdb and sqlite), for example .shell echo $HOME This displays the value of the $HOME environment variable. But there is no way, I think, to pick up and use such a value in a query? Or maybe there is already? |
Beta Was this translation helpful? Give feedback.
-
Maybe you can use the shell / CLI and a prepared statement! Maybe you could pass in an environment variable when you execute it. I haven't tested this though! |
Beta Was this translation helpful? Give feedback.
-
It would be great if this could support views onto files in s3 too. For example I am using something like:
I then can just make queries against the table green without long fully qualified uris etc. But there's no way to then add in a table from a bucket with different credentials, for example to performs a join with, e.g using:
Because of course the new credentials break the access to the green table. I appreciate there is a sort of workaround noted above to copy the table into duckdb but this feels like wasted traffic - you may end up pulling only a small amount of data from s3 in the ultimate query. While I am here - including secrets in the sql query feels risky to me. You may be providing an end user with the query input and they may not even ne aware where the data is stored. You would want to make very sure none of this leaked back in any errors, e.g. a sql parse error or something. Even returning these secrets in the duckdb_settings() sounds like it could be risky in certain usages if an user is running queries against duckdb. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
hi there,
I love duckdb its awesome!!!
Is there a way to add multiple S3 access keys in the httpfs extension to support pulling data from multiple buckets/s3 environments?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions