How to connect to multiple s3 buckets? #4460

cookiejest · 2022-08-10T07:53:23Z

cookiejest
Aug 10, 2022

hi there,

I love duckdb its awesome!!!

Is there a way to add multiple S3 access keys in the httpfs extension to support pulling data from multiple buckets/s3 environments?

Thanks!

cookiejest · 2022-08-10T07:54:31Z

cookiejest
Aug 10, 2022
Author

This works great but is it possible to pass them as part of the read_csv_auto() function or some other way?
SET s3_region='eu-west-1';
SET s3_access_key_id=xxx;
SET s3_secret_access_key=xxx;

0 replies

samansmink · 2022-08-10T08:29:53Z

samansmink
Aug 10, 2022
Collaborator

Hi @cookiejest, we currently don't support having different sets of credentials at the same time for different files. To fetch data with different credentials you will need to switch the credentials between the queries:

SET s3_access_key_id=first_env_key; SET s3_secret_access_key=xxx;
SELECT * FROM 's3://bucket1/file1.csv'
SET s3_access_key_id=second_env_key; SET s3_secret_access_key=xxx;
SELECT * FROM 's3://bucket2/file2.csv'

you can also copy (parts of) a remote table after which you can query it without credentials

SET s3_access_key_id=first_env_key; SET s3_secret_access_key=xxx;
CREATE TABLE file1_copy AS SELECT * FROM 's3://bucket1/file1.csv where col='somevalue'

0 replies

cookiejest · 2022-08-10T08:47:23Z

cookiejest
Aug 10, 2022
Author

ok i thought about that, im new to duckdb, what is the performance of this like? Does it essentially copy the whole table into memory by doing a CREATE TABLE command or can it still work on tables bigger than RAM memory?

Thanks!

0 replies

samansmink · 2022-08-10T09:22:49Z

samansmink
Aug 10, 2022
Collaborator

@cookiejest yes this does do a copy. You can start duckdb in two ways: as a transient in-memory database or as a persistent database by passing a path: duckdb your_db_file.db. The persistent db can be bigger than RAM. For more info on this check out our docs: https://duckdb.org/docs/

0 replies

mskyttner · 2022-08-18T10:16:50Z

mskyttner
Aug 18, 2022

Running into a similar situation and wondering if this issue could be turned into a feature request for some kind of mechanism to make this a bit more convenient.

For example, what about allowing a "compound setting" for an S3_URI following a connectionstring like format (similar to what Arrow supports, ie a string looking like s3://[access_key:secret_key@]bucket/path[?region=]).

This in combination with a duckdb function to allow reading values for environment variables ("getenv" utility function) might be one mechanism to facilitate reading from multiple S3 sources?

0 replies

Alex-Monahan · 2022-08-18T11:25:51Z

Alex-Monahan
Aug 18, 2022

One of the design goals of DuckDB is to not depend on environment variables, so we would need an alternative method!

1 reply

archiewood Oct 6, 2023

Stumbling across this.
This is very interesting. Out of interest why?

What's the intended way to set secrets for prod environments?

(We have this issue if someone wants to use HTTPFS with Evidence, for example)

cookiejest · 2022-08-18T12:26:00Z

cookiejest
Aug 18, 2022
Author

can it be done using a forloop like in postgresql?

BEGIN
FOR r IN
CONNECTIONLIST
LOOP
SET...
SET...
select *...
END LOOP;
RETURN;
END;

0 replies

mskyttner · 2022-08-19T07:54:18Z

mskyttner
Aug 19, 2022

@Alex-Monahan I'm not suggesting to introduce a dependency on environment variables, but feel that it would be great if there was a function that would simplify using data stored in an environment variable. There are functions to read data from csv, json, stdin so why not allow reading from a system environment variable?

There is the "dot command" .shell that can be used already (both in duckdb and sqlite), for example

.shell echo $HOME

This displays the value of the $HOME environment variable. But there is no way, I think, to pick up and use such a value in a query? Or maybe there is already?

0 replies

Alex-Monahan · 2022-08-19T12:40:15Z

Alex-Monahan
Aug 19, 2022

Maybe you can use the shell / CLI and a prepared statement! Maybe you could pass in an environment variable when you execute it. I haven't tested this though!
https://duckdb.org/docs/api/cli#prepared-statements

4 replies

mskyttner Aug 23, 2022

Nice idea! Hmmm... a prepared statement is a single statement (I think) and unlike a stored procedure in that sense, which means I'd have to do it using a single statement?

If I want to pass an environment variable as a parameter in a prepared statement in a single step, I could try to use a "dot command" shelling out to run a system command, such as .system env | grep S3_URI which would display the value (if any set) for the system environment variable containing for example an S3 connection string URI.

It doesn't seem that dot commands are supported in PREPARE statements though:

D PREPARE S1 AS .system echo $1;
Error: Parser Error: parser error : syntax error at or near "."
LINE 1: PREPARE S1 AS .system echo $1;
                      ^
D PREPARE S1 AS .shell echo $1;
Error: Parser Error: parser error : syntax error at or near "."
LINE 1: PREPARE S1 AS .shell echo $1;

So maybe currently this cannot be done a prepared statement? If there had been a utility function getenv(sysenvname) it might have worked better.

So it looks like currently I would need more than one step - something lika a stored procedure with a first step to write the value stored in the environment variable to a file and then one step using a function to read its value (such as read_csv). Slightly convoluted but something along these lines (using $HOME as an example system variable value to read):

D .system printf 'envvar,value\\\nHOME,%s' "$HOME" > /tmp/env
D prepare s1 as select value from read_csv_auto('/tmp/env', header=TRUE) where envvar=$1;
D execute s1('HOME');

But there is no support for multiple prepared statements / stored procedures currently, I think.

It seems that the httpfs extension has a global state for the higher level S3 connection source configuration (and expects bucket name and file specifications to change in the 's3://'-string). So when changing from one source to another one needs to explicitly overwrite those values with new ones before the running a query against another S3 source:

set s3_endpoint=$S3_ENDPOINT_1;
set s3_access_key_id=$S3_USER_1;
set s3_secret_access_key=$S3_PASS_1;
set s3_use_ssl=$S3_USE_SSL_1;
set s3_region=$S3_REGION_1;
set s3_url_style=$S3_URL_STYLE_1;
select count(*) from 's3://bucket_in_source_1/*.parquet';

set s3_endpoint=$S3_ENDPOINT_2;
set s3_access_key_id=$S3_USER_2;
set s3_secret_access_key=$S3_PASS_2;
set s3_use_ssl=$S3_USE_SSL_2;
set s3_region=$S3_REGION_2;
set s3_url_style=$S3_URL_STYLE_2;
select count(*) from 's3://bucket_in_source_2/*.parquet';

It would be cool if an S3_URI connection string could be used to allow something like this:

select count(*) from getenv('S3_URI_1') || '/bucket_in_source_1/*.parquet';
select count(*) from getenv('S3_URI_2') || '/bucket_in_source_2/*.parquet';

orzom411 Aug 29, 2022

Consider writing out a script you want to perform then ".read" it in. For example employing ".once":

.mode list
.headers off
.separator "\r\n" "\r\n"
;
.once "process_s3_1.sql"
SELECT 'set s3_endpoint=''myENDPOINT'';' as stmt01
, 'set s3_access_key_id=''blablabla'';' as stmt02;

.read "process_s3_1.sql"

This technique is similar to an execute type concept with one interesting benefit, there is quite a bit you can work into a script and you have a simple trail to follow for post process reporting and analysis.

Hope that helps. Have a wonderful week.

mskyttner Sep 5, 2022

Thanks for the suggestion, it sounds like a way to do "meta programming" in the CLI using the dot commands to generate "SQL scripts" which unlike a prepared statement allows more than one step to be executed, in essence doing things like what stored procedures can do.

Unlike with the .duckdbrc file (which runs once at startup) one could use this mechanism anytime during a CLI session and I guess the mechanism could be used to "shell out" and run a system command to write system environment variables to a duckdb table, after which this data would be available for use in SQL statements.

The suggestion to support being able to use a duckdb function to read data directly from an environment variable could alleviate this currently required "level of indirection" and would also not be tied to usage from inside the CLI.

Combined with extended support for S3_URIs (or S3-URLs) - like what for example neo4j and arrow allows - maybe it could be used as a nice base for supporting as @samansmink mentions "having different sets of credentials at the same time against different files" residing in different S3 object storage locations, similar to:

-- examples of "fully qualified" S3_URI patterns
select count(*) from 's3://endpoint:port/bucket/key?accessKey=accessKey&secretKey=secretKey[&sessionToken=sessionToken] (sessionToken is optional)'
select count(*) from 's3://minioadmin:minioadmin@?scheme=http&endpoint_override=localhost%3A9000'

Supporting URI options to be passed as URL-encoded parameters in the "connectionstring" (and in addition reading such values from environment variables) might facilitate combining data from different S3 buckets - while keeping some of the connectionstring mojo hidden (if not fully secret) if those could be read from environment variables.

Maybe even allowing for joining data in two different S3 buckets - even when located in different object storage servers - in one query?

samansmink Sep 5, 2022
Collaborator

@mskyttner The fully qualified s3 urls seem like a nice feature to me, as this would allow joining data from different object storage servers in a single query, i'll open an issue for this!

chrisbrain · 2022-09-13T16:35:23Z

chrisbrain
Sep 13, 2022

It would be great if this could support views onto files in s3 too. For example I am using something like:

SET s3_region='eu-west-2';
SET s3_access_key_id='XXX';
SET s3_secret_access_key='YYY';
create view green as select * from 's3://my-bucket/green_tripdata_2022-01.parquet';

I then can just make queries against the table green without long fully qualified uris etc.

But there's no way to then add in a table from a bucket with different credentials, for example to performs a join with, e.g using:

SET s3_region='eu-west-1';
SET s3_access_key_id='AAA';
SET s3_secret_access_key='BBB';
create view yellow as select * from 's3://my-other-bucket/yellow_tripdata_2022-01.parquet';

Because of course the new credentials break the access to the green table.

I appreciate there is a sort of workaround noted above to copy the table into duckdb but this feels like wasted traffic - you may end up pulling only a small amount of data from s3 in the ultimate query.

While I am here - including secrets in the sql query feels risky to me. You may be providing an end user with the query input and they may not even ne aware where the data is stored. You would want to make very sure none of this leaked back in any errors, e.g. a sql parse error or something. Even returning these secrets in the duckdb_settings() sounds like it could be risky in certain usages if an user is running queries against duckdb.

1 reply

Tishj Oct 6, 2023
Collaborator

I figured it was worth noting that this is ppssible now:
https://duckdb.org/docs/extensions/httpfs#per-request-configuration

How to connect to multiple s3 buckets? #4460

Uh oh!

Replies: 10 comments · 6 replies

Uh oh!

cookiejest Aug 10, 2022 Author

Uh oh!

samansmink Aug 10, 2022 Collaborator

Uh oh!

cookiejest Aug 10, 2022 Author

Uh oh!

samansmink Aug 10, 2022 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cookiejest Aug 18, 2022 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samansmink Sep 5, 2022 Collaborator

Uh oh!

Uh oh!

Tishj Oct 6, 2023 Collaborator

Replies: 10 comments 6 replies

cookiejest
Aug 10, 2022
Author

samansmink
Aug 10, 2022
Collaborator

cookiejest
Aug 10, 2022
Author

samansmink
Aug 10, 2022
Collaborator

cookiejest
Aug 18, 2022
Author

samansmink Sep 5, 2022
Collaborator

Tishj Oct 6, 2023
Collaborator