Skip to content

Conversation

quentingodeau
Copy link
Contributor

The file opener allow to access some client context data as secret and variables not exposing them in some method make them no implementable in extensions.

The file opener allow to access some client context data as secret and variables
not exposing them in some method make them no implementable in extensions.
@quentingodeau
Copy link
Contributor Author

Hello!

Let me explain a bit this PR. I'm working on the integration of the write capability in the DuckDB azure extension and will performing my initial test I discover that it was missing the file opener in some on the interfaces, and it is through it that I'm able to access the the current context and therefore secret, variables.
Now the issue with this PR: it break the API...
So the point I would like to discuss see with you:
Is it ok to break the API here?
Or should I should add new methods that redirect by default to the previous one?
Also should I add the FileOpener to the bool IsPipe(const string &filename) method?
If we go this the API breaking changes shouldn't we change the type of FileOpener* to optional_ptr<FileOpener> like it's already done in GetHomeDirectory(optional_ptr<FileOpener> opener)?

@github-actions github-actions bot marked this pull request as draft March 19, 2024 21:28
@samansmink
Copy link
Contributor

samansmink commented Mar 20, 2024

Hey @quentingodeau! Yea I think breaking the API here may be justified. My first Intuition would be to do the whole API and indeed replace the raw pointers with optional pointers. However for this I would prefer to have this merged in feature and not main, which would mean that we can not get the azure writes working for duckdb v1.0.0 but at the earliest v1.1.0.

The alternative would be to sneak this version of the PR in into main with the idea that the API methods that we break here are less used.

Another possibility is to first add the non-breaking API to allow azure writes to work in v1.0.0, then refactor the api afterwards. This is a bit more work but perhaps the cleanest.

Im a bit undecided here, @Mytherin any opinion on the matter?

@Mytherin
Copy link
Collaborator

I do have some opinions here yeah, and have tried to tackle this myself in the past. I think we should preserve the old API methods and add a second overload that has an optional_ptr< FileOpener>.

TryOpenFile, TryCreateDirectory, etc

I think we also need to look at how these APIs are used at a more fundamental level. If you think about e.g. FileExists or DirectoryExists - they are never used in isolation. They are always used as part of an operation. For example:

if (!DirectoryExists(dir)) {
   CreateDirectory(dir);
}

This is not great because (1) it is two file system operations, and (2) there is an obvious race condition here. A better API would be something like:

TryCreateDirectory(dir);

This is also something that I tried to tackle in the above changeset (although it has been a while). Ideally we remove all usages of DirectoryExists and FileExists and replace them with these kinds of APIs - at least in the main system (tests are less important).

OpenerFileSystem

One change I also made in that PR is that passing in an opener manually at all should not be necessary. After all, if you think about how we usually use the file system, it works like this:

auto &fs = FileSystem::GetFileSystem(context);
auto file = fs.OpenFile(..., FileOpener::GetOpener(context));

Now what I have tried in the above changeset is made FIleSystem::GetFileSystem(context) return a wrapper file system (OpenerFileSystem ) that forwards the relevant opener to all system calls. This also simplifies all usages of the file system in general and makes it so forgetting to pass in an opener is no longer a possibility.

GetFileType

Finally, FileExists and DirectoryExists are somewhat inefficient - at least in a local file system. What the syscalls actually give you is a file type. So it makes sense to have a GetFileType(...) method that is used by FileExists and DirectoryExists. There are a number of places where we call invoke these syscalls multiple times (FileExists, IsPipe, etc) where a single GetFileType would have been sufficient.

@Maxxen
Copy link
Member

Maxxen commented Mar 20, 2024

Finally, FileExists and DirectoryExists are somewhat inefficient - at least in a local file system. What the syscalls actually give you is a file type. So it makes sense to have a GetFileType(...) method that is used by FileExists and DirectoryExists. There are a number of places where we call invoke these syscalls multiple times (FileExists, IsPipe, etc) where a single GetFileType would have been sufficient.

I have been struggling with this a ton over in spatial trying to provide file handler hooks to GDAL and emulate stat(). Having a proper way to get the file type and/or size without opening (and without throwing) would be great.

@quentingodeau
Copy link
Contributor Author

Ok I have read all your comments I will make the changes.
One question/proposition, should we split the two APIs? I mean the one with the fileopener and the other without? And if yes does that mean that the one with the fileopener can take it as a reference instead of a pointer or is there some cases where the fileopener is optional (I do not see the case but I'm maybe missing something here)

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revisited my old PR and I did actually merge the OpenerFileSystem into main in #7423. I had forgotten about that. So this should greatly simplify the passing of the opener - the OpenerFileSystem should do this automatically (as I see already happens in the PR).

I would say let's add a optional_ptr<FileOpener> opener = nullptr to all these APIs and just break compatibility. The actual code using the file system should remain unchanged, and all that needs to happen is the file systems need to be updated.

@@ -125,7 +126,8 @@ BoundStatement Binder::BindCopyTo(CopyStatement &stmt) {
if (is_remote_file) {
use_tmp_file = false;
} else {
bool is_file_and_exists = config.file_system->FileExists(stmt.info->file_path);
bool is_file_and_exists =
config.file_system->FileExists(stmt.info->file_path, ClientData::Get(context).file_opener.get());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this use FileSystem::GetFileSystem(context) instead? That way we don't need to pass a FileOpener explicitly. It's unclear to me why this is using config.file_system in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will make the change here :)

void RemoveDirectory(const string &directory) override {
return GetFileSystem().RemoveDirectory(directory);
void RemoveDirectory(const string &directory, FileOpener *opener = nullptr) override {
return GetFileSystem().RemoveDirectory(directory, GetOpener().get());
Copy link
Collaborator

@Mytherin Mytherin Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these all throw an exception if an opener is passed in, similar to the other code? Otherwise the opener parameter gets silently ignored which could be confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does a D_ASSERT from src/include/duckdb/common/assert.hpp is ok instead of an exception ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine as well

@quentingodeau
Copy link
Contributor Author

@Mytherin out of curiosity does the file opener can be null ?

@Mytherin
Copy link
Collaborator

@Mytherin out of curiosity does the file opener can be null ?

It is possible in some tests, but it shouldn't be in most code paths

@Mytherin
Copy link
Collaborator

After looking at this some more I've decided I'm going to pick up my old TryOpenFile PR. Let me know if you want me to take over this PR and add optional_ptr<FileOpener> to these APIs while I'm there, or if you want to finish this PR yourself.

@quentingodeau
Copy link
Contributor Author

I would like to work on it if that of with you (it's a way for me to get a bit more familiar with the duckdb code base). to give you an insight on what i'm doing:

  • [add] TryOpenFile
  • [add] optional_ptr<FileOpener>
  • [rename&adapt] CreateDirectory =>CreateDirectoryIfNotExists
  • [rename&adapt] RemoveDirectory => RemoveDirectoryIfExists
  • [rename&adapt] RemoveFile => RemoveFileIfExists
  • [remove] IsPipe
  • [add] GetFileType
  • [remove] GetHomeDirectory() and replace by GetHomeDirectory(optional_ptr<FileOpener> opener = {})

I didn't go with the TryCreateDirectory because it fill like it's ok to failed and there will be no error. But if you feel like the names I choose do not fit, do not hesitate to tell me I will make the changes.

@Mytherin
Copy link
Collaborator

Thanks - your contribution is much appreciated, but the FileSystem abstraction is very central, and already used in many different locations (httpfs/s3, fsspec, spatial & azure). I think we do need to break the abstraction to fix the problems that you found here, and the problems that I mentioned, but this is a big undertaking that needs to touch many different areas of the code. If you want to finish the PR here that would be excellent, but I think it's a better idea if I do the actual (larger) refactor that I described myself.

@quentingodeau
Copy link
Contributor Author

quentingodeau commented Mar 21, 2024

Ok I understand in that case I will drop it and let you manage it.

@Mytherin Please when you do the PR for this can you add my name so I can see the changes and sync my duckdb_azure branch.

@Mytherin
Copy link
Collaborator

Thanks for understanding - I'll keep you posted!

@Mytherin
Copy link
Collaborator

I have an initial PR open here - #11297

This PR adds the TryOpenFile (although I turned it into a flag instead of a separate method after further consideration). I'm working on a second PR that adds the FileOpener to the various interfaces like in this PR, that also enables attaching DuckDB database files over S3 (and likely will enable attaching over Azure as well with some minor modifications).

@MiguelElGallo
Copy link

@quentingodeau Is this what you needed to please continue with duckdb/duckdb-azure#44 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants