Skip to content

Conversation

fgregg
Copy link
Contributor

@fgregg fgregg commented Oct 27, 2022

Opening db files in immutable mode sometimes leads to the file being mutated, which causes duplication in the docker image layers: see #1836, #1480

That this happens in "immutable" mode is surprising, because the sqlite docs say that setting this should open the database as read only.

https://www.sqlite.org/c3ref/open.html

immutable: The immutable parameter is a boolean query parameter that indicates that the database file is stored on read-only media. When immutable is set, SQLite assumes that the database file cannot be changed, even by a process with higher privilege, and so the database is opened read-only and all locking and change detection is disabled. Caution: Setting the immutable property on a database file that does in fact change can result in incorrect query results and/or SQLITE_CORRUPT errors. See also: SQLITE_IOCAP_IMMUTABLE.

Perhaps this is a bug in sqlite?


📚 Documentation preview 📚: https://datasette--1870.org.readthedocs.build/en/1870/

@fgregg
Copy link
Contributor Author

fgregg commented Oct 27, 2022

@codecov
Copy link

codecov bot commented Oct 27, 2022

Codecov Report

Base: 92.55% // Head: 92.55% // No change to project coverage 👍

Coverage data is based on head (4faa4fd) compared to base (bf00b0b).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1870   +/-   ##
=======================================
  Coverage   92.55%   92.55%           
=======================================
  Files          35       35           
  Lines        4432     4432           
=======================================
  Hits         4102     4102           
  Misses        330      330           
Impacted Files Coverage Δ
datasette/app.py 94.30% <ø> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@fgregg
Copy link
Contributor Author

fgregg commented Oct 28, 2022

as far as i can tell, this is where the "immutable" argument is used in sqlite:

      pPager->noLock = sqlite3_uri_boolean(pPager->zFilename, "nolock", 0);
      if( (iDc & SQLITE_IOCAP_IMMUTABLE)!=0
       || sqlite3_uri_boolean(pPager->zFilename, "immutable", 0) ){
          vfsFlags |= SQLITE_OPEN_READONLY;
          goto act_like_temp_file;
      }

so it does set the read only flag, but then has a goto.

@simonw
Copy link
Owner

simonw commented Oct 29, 2022

Just saw your comment here: #1836 (comment)

when you are running from docker, you always will want to run as mode=ro because the same thing that is causing duplication in the inspect layer will cause duplication in the final container read/write layer when datasette serve runs.

I don't understand this. My mental model of how Docker works is that the image itself is created using docker build... but then when the image runs later on (docker run) the image itself isn't touched at all.

Are you saying that I can build a container, but then when I run it and it does datasette serve -i data.db ... it will somehow modify the image, or create a new modified filesystem layer in the runtime environment, as a result of running that serve command?

@simonw
Copy link
Owner

simonw commented Oct 29, 2022

Saw your comment here too: #1480 (comment)

switching from immutable=1 to mode=ro completely addressed this. see #1836 (comment) for details.

So maybe we need a special case for containers that are intended to be run using Docker - the ones produced by datasette package and datasette publish cloudrun? Those are cases where the -i option should actually be opened in read-only mode, not immutable mode.

Maybe a datasette serve --irw data.db option for opening a file in immutable-but-actually-read-only mode? Bit ugly though.

I should run some benchmarks to figure out if immutable really does offer significant performance benefits.

@fgregg
Copy link
Contributor Author

fgregg commented Oct 29, 2022

Are you saying that I can build a container, but then when I run it and it does datasette serve -i data.db ... it will somehow modify the image, or create a new modified filesystem layer in the runtime environment, as a result of running that serve command?

Somehow, datasette serve -i data.db will lead to the data.db being modified, which will trigger a copy-on-write of data.db into the read-write layer of the container.

I don't understand how that happens.

it kind of feels like a bug in sqlite, but i can't quite follow the sqlite code.

@jdangerx
Copy link

jdangerx commented Oct 3, 2023

Hello! Resurrecting this issue since we're running into something similar with data.catalyst.coop as our database files have ballooned up to several GB. Our Cloud Run revisions now require huge amounts of RAM to start up without receiving a SIGBUS.

I'd love to see this fix merged in. It sounds like we want to make the immutable/read-only mode decision more flexible before doing so, so that we can use ro in Docker and immutable outside. If that sounds right, I'm happy to take a crack at adding that as a command-line flag or something that gets set automatically based on the expected execution environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants