Skip to content

Conversation

geky
Copy link
Member

@geky geky commented May 29, 2025

                  ####                                 
               ######## #####    #####                 
             ################# ########  ####          
       #### ### ############################## ####    
     #############-#### ##_#########################   
    ########'#########_\'|######_####################  
     ##### ##########_.. \ .'#_._#### #/#_##########   
   ################_    \ v .'_____#_.'.'_ ########### 
  ################_ "'. |  .""  _________ '"-##########
  ########"#####..--.  /    .-"'|  | |   '"-####'######
    #############    \     /  .---.|_|_    ########### 
     ###########     |    /  -|        |-   ########   
                     )    |  -|littlefs|-              
                     |    |  -|   v3   |-              
                     |    |   '--------'               
                     |   ||     ' ' '                  
       -------------/-/---\\----------------------     

Note: v3-alpha discussion (#1114)

Unfortunately GitHub made a complete mess of the PR discussion. To try to salvage things, please use #1114 for new comments. Feedback/criticism are welcome and immensely important at this stage.

Table of contents ^

  1. Hello!
  2. Wait, a disk breaking change?
  3. What's new?
    1. Implemented
      1. Efficient metadata compaction
      2. Efficient random writes
      3. Better logging: No more sync-padding issues
      4. Efficient inline files, no more RAM constraints
      5. Independent file caches
      6. Easier logging APIs: lfs3_file_fruncate
      7. Sparse files
      8. Efficient file name lookup
      9. A simpler/more robust metadata tree
      10. A well-defined sync model
      11. Stickynotes, no more 0-sized files
      12. A new and improved compat flag system
      13. Error detection! - Global-checksums
      14. Better traversal APIs
      15. Incremental GC
      16. Better recovery from runtime errors
      17. Standard custom attributes
      18. More tests!
      19. Simple key-value APIs
    2. Planned
      1. Efficient block allocation, via optional on-disk block-map (bmap)
      2. Bad block tracking
      3. Pre-erased block tracking
      4. Error correction! - Metadata redundancy
      5. Error correction! - Data redundancy
      6. Transparent block deduplication
    3. Stretch goals
      1. lfs3_migrate for v2->v3 migration
      2. 16-bit and 64-bit variants
      3. Config API rework
      4. Block device API rework
      5. Custom attr API rework
      6. Alternative (cheaper) write-strategies
      7. Advanced file tree operations
      8. Advanced file copy-on-write operations
      9. Reserved blocks to prevent CoW lockups
      10. Metadata checks to prevent metadata lockups
      11. Integrated block-level ECC
      12. Disk-level RAID
    4. Out-of-scope (for now)
      1. Alternative checksums
      2. Feature-limited configurations for smaller code/stack sizes
      3. lfs3_file_openat for dir-relative APIs
      4. lfs3_file_openn for non-null-terminated-string APIs
      5. Transparent compression
      6. Filesystem shrinking
      7. High-level caches
      8. Symbolic links
      9. 100% line/branch coverage
  4. Code/stack size
    1. Runtime error recovery
    2. B-tree flexibility
    3. Traversal inversion
  5. Benchmarks
    1. Simulated benchmarks
      1. Linear writes
      2. Random writes
      3. Logging
  6. Funding
  7. Next steps

Hello! ^

Hello everyone! As some of you may have already picked up on, there's been a large body of work fermenting in the background for the past couple of years. Originally started as an experiment to try to solve littlefs's $O(n^2)$ metadata compaction, this branch eventually snowballed into more-or-less a full rewrite of the filesystem from the ground up.

There's still several chunks of planned work left, but now that this branch has reached on-disk feature parity with v2, there's nothing really stopping it from being merged eventually.

So I figured it's a good time to start calling this v3, and put together a public roadmap.

NOTE: THIS WORK IS INCOMPLETE AND UNSTABLE

Here's a quick TODO list of planned work before stabilization. More details below:

  • Test framework rework (merged, mostly)
  • Rbyds
  • B-trees
  • M-tree
  • B-shrubs
  • Fruncate
  • Sync model rework
  • Stickynotes
  • Traversal API rework
  • Incremental GC
  • Global-checksums
  • Key-value APIs
  • On-disk block-map (in progress)
  • Metadata redundancy
  • Data redundancy
  • Document, document, document
  • Stretch goals

This work may continue to break the on-disk format.

That being said, I highly encourage others to experiment with v3 where possible. Feedback is welcome, and immensely important at this stage. Once it's stabilized, it's stabilized.

To help with this, the current branch uses a v0.0 as its on-disk version to indicate that it is experimental. When it is eventually released, v3 will reject this version and fail to mount.

Unfortunately, the API will be under heavy flux during this period.

A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.

Wait, a disk breaking change? ^

Yes. v3 breaks disk compatibility from v2.

I think this is a necessary evil. Attempting to maintain backwards compatibility has a heavy cost:

  1. Development time - The littlefs team is ~1 guy, and v3 has already taken ~2.5 years. The extra work to make everything compatible would stretch this out much longer and likely be unsustainable.

  2. Code cost - The goal of littlefs is to be, well, little. This is unfortunately in conflict with backwards compatibility.

    Take the new B-tree data-structure, for example. It would be easy to support both B-tree and CTZ skip-list files, but now you need ~2x the code. This cost gets worse for the more enmeshed features, and potentially exceeds the cost of just including both v3 and v2 in the codebase.

So I think it's best for both littlefs as a project and long-term users to break things here.

Note v2 isn't going anywhere! I'm happy to continue maintaining the v2 branch, merge bug fixes when necessary, etc. But the economic reality is my focus will be shifting to v3.

What's new ^

Ok, with that out of the way, what does breaking everything actually get us?

Implemented: ^

  • Efficient metadata compaction: $O(n^2) \rightarrow O(n \log n)$ ^

    v3 adopts a new metadata data-structure: Red-black-yellow Dhara trees (rbyds). Based on the data-structure invented by Daniel Beer for the Dhara FTL, rbyds extend log-encoded Dhara trees with self-balancing and self-counting (also called order-statistic) properties.

    This speeds up most metadata operations, including metadata lookup ( $O(n) \rightarrow O(\log n)$ ), and, critically, metadata compaction ( $O(n^2) \rightarrow O(n \log n)$ ).

    This improvement may sound minor on paper, but it's a difference measured in seconds, sometimes even minutes, on devices with extremely large blocks.

  • Efficient random writes: $O(n) \rightarrow O(\log_b^2 n)$ ^

    A much requested feature, v3 adopts B-trees, replacing the CTZ skip-list that previously backed files.

    This avoids needing to rewrite the entire file on random reads, bringing the performance back down into tractability.

    For extra cool points, littlefs's B-trees use rbyds for the inner nodes, which makes CoW updates much cheaper than traditional array-packed B-tree nodes when large blocks are involved ( $O(n) \rightarrow O(\log n)$ ).

  • Better logging: No more sync-padding issues ^

    v3's B-trees support inlining data directly in the B-tree nodes. This gives us a place to store data during sync, without needing to pad things for prog alignment.

    In v2 this padding would force the rewriting of blocks after sync, which had a tendency to wreck logging performance.

  • Efficient inline files, no more RAM constraints: $O(n^2) \rightarrow O(n \log n)$ ^

    In v3, B-trees can have their root inlined in the file's mdir, giving us what I've been calling a "B-shrub". This, combined with the above inlined leaves, gives us a much more efficient inlined file representation, with better code reuse to boot.

    Oh, and B-shrubs also make small B-trees more efficient by avoiding the extra block needed for the root.

  • Independent file caches ^

    littlefs's pcache, rcache, and file caches can be configured independently now. This should allow for better RAM utilization when tuning the filesystem.

  • Easier logging APIs: lfs3_file_fruncate ^

    Thanks to the new self-counting/order-statistic properties, littlefs can now truncate from both the end and front of files via the new lfs3_file_fruncate API.

    Before, the best option for logging was renaming log files when they filled up. Now, maintaining a log/FIFO is as easy as:

    lfs3_file_write(&lfs, &file, entry, entry_size) => entry_size;
    lfs3_file_fruncate(&lfs, &file, log_size) => 0;
  • Sparse files ^

    Another advantage of adopting B-trees, littlefs can now cheaply represent file holes, where contiguous runs of zeros can be implied without actually taking up any disk space.

    Currently this is limited to a couple operations:

    • lfs3_file_truncate
    • lfs3_file_fruncate
    • lfs3_file_seek + lfs3_file_write past the end of the file

    But more advanced hole operations may be added in the future.

  • Efficient file name lookup: $O(n) \rightarrow O(\log_b n)$ ^

    littlefs now uses a B-tree (yay code reuse) to organize files by file name. This allows for much faster file name lookup than the previous linked-list of metadata blocks.

  • A simpler/more robust metadata tree ^

    As a part of adopting B-trees for metadata, the previous threaded file tree has been completely ripped out and replaced with one big metadata tree: the M-tree.

    I'm not sure how much users are aware of it, but the previous threaded file tree was a real pain-in-the-ass with the amount of bugs it caused. Turns out having a fully-connected graph in a CoBW filesystem is a really bad idea.

    In addition to removing an entire category of possible bugs, adopting the M-tree allows for multiple directories in a single metadata block, removing the 1-dir = 1-block minimum requirement.

  • A well-defined sync model ^

    One interesting thing about littlefs, it doesn't have a strictly POSIX API. This puts us in a relatively unique position, where we can explore tweaks to the POSIX API that may make it easer to write powerloss-safe applications.

    To leverage this (and because the previous sync model had some real problems), v3 includes a new, well-defined sync model.

    I think this discussion captures most of the idea, but for a high-level overview:

    1. Open file handles are strictly snapshots of the on-disk state. Writes to a file are copy-on-write (CoW), with no immediate affect to the on-disk state or any other file handles.

    2. Syncing or closing an in-sync file atomically updates the on-disk state and any other in-sync file handles.

    3. Files can be desynced, either explicitly via lfs3_file_desync, or because of an error. Desynced files do not recieve sync broadcasts, and closing a desynced file has no affect on the on-disk state.

    4. Calling lfs3_file_sync on a desynced file will atomically update the on-disk state, any other in-sync file handles, and mark the file as in-sync again.

    5. Calling lfs3_file_resync on a file will discard its current contents and mark the file as in-sync. This is equivalent to
      closing and reopening the file.

  • Stickynotes, no more 0-sized files ^

    As an extension of the littlefs's new sync model, v3 introduces a new file type: LFS3_TYPE_STICKYNOTE.

    A stickynote represents a file that's in the awkward state of having been created, but not yet synced. If you lose power, stickynotes are hidden from the user and automatically cleaned up on the next mount.

    This avoids the 0-sized file issue, while still allowing most of the POSIX interactions users expect.

  • A new and improved compat flag system ^

    v2.1 was a bit of a mess, but it was a learning experience. v3 still includes a global version field, but also includes a set of compat flags that allow non-linear addition/removal of future features.

    These are probably familiar to users of Linux filesystems, though I've given them slightly different names:

    • rcompat flags - Must understand to read the filesystem (incompat_flags)
    • wcompat flags - Must understand to write to the filesystem (ro_compat_flags)
    • ocompat flags - No understanding necessary (compat_flags)

    This also provides an easy route for marking a filesystem as read-only, non-standard, etc, on-disk.

  • Error detection! - Global-checksums ^

    v3 now supports filesystem-wide error-detection. This is actually quite tricky in a CoBW filesystem, and required the invention of global-checksums (gcksums) to prevent rollback issues caused by naive checksumming.

    With gcksums, and a traditional Merkle-tree-esque B-tree construction, v3 now provides a filesystem-wide self-validating checksum via lfs3_fs_cksum. This checksum can be stored external to the filesystem to provide protection against last-commit rollback issues, metastability, or just for that extra peace of mind.

    Funny thing about checksums. It's incredibly cheap to calculate checksums when writing, as we're already processing that data anyways. The hard part is, when do you check the checksums?

    This is a problem that mostly ends up on the user, but to help, v3 adds a large number checksum checking APIs (probably too many if I'm honest):

    • LFS3_M_CKMETA/CKDATA - Check checksums during mount
    • LFS3_O_CKMETA/CKDATA - Check checksums during file open
    • lfs3_fs_ckmeta/ckdata - Explicitly check all checksums in the filesystem
    • lfs3_file_ckmeta/ckdata - Explicitly check a file's checksums
    • LFS3_T_CKMETA/CKDATA - Check checksums incrementally during a traversal
    • LFS3_GC_CKMETA/CKDATA - Check checksums during GC operations
    • LFS3_M_CKPROGS - Closed checking of data during progs
    • LFS3_M_CKFETCHES - Optimistic (not closed) checking of data during fetches
    • LFS3_M_CKREADS (planned) - Closed checking of data during reads
  • Better traversal APIs ^

    The traversal API has been completely reworked to be easier to use (both externally and internally).

    No more callback needed, blocks can now be iterated over via the dir-like lfs3_trv_read function.

    Traversals can also perform janitorial work and check checksums now, based on the flags provided to lfs3_trv_open.

  • Incremental GC ^

    GC work can now be accomplished incrementally, instead of requiring one big go. This is managed by lfs3_fs_gc, cfg.gc_flags, and cfg.gc_steps.

    Internally, this just shoves one of the new traversal objects into lfs3_t. It's equivalent to managing a traversal object yourself, but hopefully makes it easier to write library code.

    However, this does add a significant chunk of RAM to lfs3_t, so GC is now an opt-in feature behind the LFS3_GC ifdef.

  • Better recovery from runtime errors ^

    Since we're already doing a full rewrite, I figured let's actually take the time to make sure things don't break on exceptional errors.

    Most in-RAM filesystem state should now revert to the last known-good state on error.

    The one exception involves file data (not metadata!). Reverting file data correctly turned out to roughly double the cost of files. And now that you can manual revert with lfs3_file_resync, I figured this cost just isn't worth it. So file data remains undefined after an error.

    In total, these changes add a significant amount of code and stack, but I'm of the opinion this is necessary for the maturing of littlefs as a filesystem.

  • Standard custom attributes ^

    Breaking disk gives us a chance to reserve attributes 0x80-0xbf for future standard custom attributes:

    • 0x00-0x7f - Free for user-attributes (uattr)
    • 0x80-0xbf - Reserved for standard-attributes (sattr)
    • 0xc0-0xff - Encouraged for system-attributes (yattr)

    In theory, it was technically possible to reserve these attributes without a disk-breaking change, but it's much safer to do so while we're already breaking the disk.

    v3 also includes the possibility of extending the custom attribute space from 8-bits to ~25-bits in the future, but I'd hesitate to to use this, as it risks a significant increase in stack usage.

  • More tests! ^

    v3 comes with a couple more tests than v2 (+~6812.2%):

         suites  cases  permutations ;     pls   runtime
    v2:      22    198         11641 ;   28741    54.16s
    v3:      23    784        804655 ; 2228513  1323.18s
    

    You may or may not have seen the test framework rework that went curiously under-utilized. That was actually in preparation for the v3 work.

    The goal is not 100% line/branch coverage, but just to have more confidence in littlefs's reliability.

  • Simple key-value APIs ^

    v3 includes a couple easy-to-use key-value APIs:

    • lfs3_get - Get the contents of a file
    • lfs3_size - Get the size of a file
    • lfs3_set - Set the contents of a file
    • lfs3_remove - Remove a file (this one already exists)

    This API is limited to files that fit in RAM, but if it fits your use case, you can disable the full file API with LFS3_KVONLY to save some code.

    If your filesystem fits in only 2 blocks, you can also define LFS3_2BONLY to save more code.

    These can be useful for creating small key-value stores on systems that already use littlefs for other storage.

Planned: ^

  • Efficient block allocation, via optional on-disk block-map (bmap) ^

    The one remaining bottleneck in v3 is block allocation. This is a tricky problem for littlefs (and any CoW/CoBW filesystem), because we don't actually know when a block becomes free.

    This is in-progress work, but the solution I'm currently looking involves 1. adding an optional on-disk block map (bmap) stored in gstate, and 2. updating it via tree diffing on sync. In theory this will drop huge file writes: $O(n^2 \log n) \rightarrow O(n \log_b^2 n)$

    There is also the option of using the bmap as a simple cache, which doesn't avoid the filesystem-wide scan but at least eliminates the RAM constraint of the lookahead buffer.

    As a plus, we should be able to leverage the self-counting property of B-trees to make the on-disk bmap compressible.

  • Bad block tracking ^

    This is a much requested feature, and adding the optional on-disk bmap finally gives us a place to track bad blocks.

  • Pre-erased block tracking ^

    Just like bad-blocks, the optional on-disk bmap gives us a place to track pre-erased blocks. Well, at least in theory.

    In practice it's a bit more of a nightmare. To avoid multiple progs, we need to mark erased blocks as unerased before progging. This introduces an unbounded number of catch-22s when trying to update the bmap itself.

    Fortunately, if instead we store a simple counter in the bmap's gstate, we can resolve things at the mrootanchor worst case.

  • Error correction! - Metadata redundancy ^

    Note it's already possible to do error-correction at the block-device level outside of littlefs, see ramcrc32cbd and ramrsbd for examples. Because of this, integrating in-block error correction is low priority.

    But I think there's potential for cross-block error-correction in addition to the in-block error-correction.

    The plan for cross-block error-correction/block redundancy is a bit different for metadata vs data. In littlefs, all metadata is logs, which is a bit of a problem for parity schemes. I think the best we can do is store metadata redundancy as naive copies.

    But we already need two blocks for every mdir, one usually just sits unused when not compacting. This, combined with metadata usually being much smaller than data, makes the naive scheme less costly than one might expect.

  • Error correction! - Data redundancy ^

    For raw data blocks, we can be a bit more clever. If we add an optional dedup tree for block -> parity group mapping, and an optional parity tree for parity blocks, we can implement a RAID-esque parity scheme for up to 3 blocks of data redundancy relatively cheaply.

  • Transparent block deduplication ^

    This one is a bit funny. Originally block deduplication was intentionally out-of-scope, but it turns out you need something that looks a lot like a dedup tree for error-correction to work in a system that allows multiple block references.

    If we already need a virtual -> physical block mapping for error correction, why not make the key the block checksum and get block deduplication for free?

    Though if this turns out to not be as free as I think it is, block deduplication will fall out-of-scope.

Stretch goals: ^

These may or may not be included in v3, depending on time and funding:

  • lfs3_migrate for v2->v3 migration ^

  • 16-bit and 64-bit variants ^

  • Config API rework ^

  • Block device API rework ^

  • Custom attr API rework ^

  • Alternative (cheaper) write-strategies (write-once, global-aligned, eager-crystallization) ^

  • Advanced file tree operations (lfs3_file_punchhole, lfs3_file_insertrange, lfs3_file_collapserange, LFS3_SEEK_DATA, LFS3_SEEK_HOLE) ^

  • Advanced file copy-on-write operations (shallow lfs3_cowcopy + opportunistic lfs3_copy) ^

  • Reserved blocks to prevent CoW lockups ^

  • Metadata checks to prevent metadata lockups ^

  • Integrated block-level ECC (ramcrc32cbd, ramrsbd) ^

  • Disk-level RAID (this is just data redund + a disk aware block allocator) ^

Out-of-scope (for now): ^

If we don't stop somewhere, v3 will never be released. But these may be added in the future:

  • Alternative checksums (crc16, crc64, sha256, etc) ^

  • Feature-limited configurations for smaller code/stack sizes (LFS3_NO_DIRS, LFS3_KV, LFS3_2BLOCK, etc) ^

  • lfs3_file_openat for dir-relative APIs ^

  • lfs3_file_openn for non-null-terminated-string APIs ^

  • Transparent compression ^

  • Filesystem shrinking ^

  • High-level caches (block cache, mdir cache, btree leaf cache, etc) ^

  • Symbolic links ^

  • 100% line/branch coverage ^

Code/stack size ^


littlefs v1, v2, and v3, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)


littlefs v2 and v3 rdonly, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)

Unfortunately, v3 is a little less little than v2:

            code            stack           ctx
v2:        17144             1440           580
v3:        37352 (+117.9%)   2280 (+58.3%)  636 (+9.7%)
            code            stack           ctx
v2-rdonly:  6270              448           580
v3-rdonly: 10616 (+69.3%)     808 (+80.4%)  508 (-12.4%)

On one hand, yes, more features generally means more code.

And it's true there's an opportunity here to carve out more feature-limited builds to save code/stack in the future.

But I think it's worth discussing some of the other reasons for the code/stack increase:

  1. Runtime error recovery ^

    Recovering from runtime errors isn't cheap. We need to track both the before and after state of things during fallible operations, and this adds both stack and code.

    But I think this is necessary for the maturing of littlefs as a filesystem.

    Maybe it will make sense to add a sort of LFS3_GLASS mode in the future, but this is out-of-scope for now.

  2. B-tree flexibility ^

    The bad news: The new B-tree files are extremely flexible. Unfortunately, this is a double-edged sword.

    B-trees, on their own, don't add that much code. They are a relatively poetic data-structure. But deciding how to write to a B-tree, efficiently, with an unknown write pattern, is surprisingly tricky.

    The current implementation, what I've taken to calling the "lazy-crystallization algorithm", leans on the more complicated side to see what is possible performance-wise.

    The good news: The new B-tree files are extremely flexible.

    There's no reason you need the full crystallization algorithm if you have a simple write pattern, or don't care as much about performance. This will either be a future or stretch goal, but it would be interesting to explore alternative write-strategies that could save code in these cases.

  3. Traversal inversion ^

    Inverting the traversal, i.e. moving from a callback to incremental state machine, adds both code and stack as 1. all of the previous on-stack state needs to be tracked explicitly, and 2. we now need to worry about what happens if the filesystem is modified mid-traversal.

    In theory, this could be reverted if you don't need incremental traversals, but extricating incremental traversals from the current codebase would be an absolute nightmare, so this is out-of-scope for now.

Benchmarks ^

A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.

First off, I would highly encourage others to do their own benchmarking with v3/v2. Filesystem performance is tricky to measure because it depends heavily on your application's write pattern and hardware nuances. If you do, please share in this thread! Others may find the results useful, and now is the critical time for finding potential disk-related performance issues.

Simulated benchmarks ^

To test the math behind v3, I've put together some preliminary simulated benchmarks.

Note these are simulated and optimistic. They do not take caching or hardware buffers into account, which can have a big impact on performance. Still, I think they provide at least a good first impression of v3 vs v2.

To find an estimate of runtime, I first measured the amount of bytes read, progged, and erased, and then scaled based on values found in relevant datasheets. The options here were a bit limited, but WinBond fortunately provides runtime estimates in the datasheets on their website:

  • NOR flash - w25q64jv

  • NAND flash - w25n01gv

  • SD/eMMC - Also w25n01gv, assuming a perfect FTL

    I said optimistic, didn't I? I could't find useful estimates for SD/eMMC, so I'm just assuming a perfect FTL here.

These also assume an optimal bus configuration, which, as any embedded engineer knows, is often not the case.

Full benchmarks here: https://benchmarks.littlefs.org (repo, commit)

And here are the ones I think are the most interesting:

Note that SD/eMMC is heavily penalized by the lack of on-disk block-map! SD/eMMC breaks flash down into many small blocks, which tends to make block allocator performance dominate.

  1. Linear writes, where we write a 1 MiB file and don't call sync until closing the file. ^

    This one is the most frustrating to compare against v2. CTZ skip-lists are really fast at appending! The problem is they are only fast at appending:

    (commit, reads, progs, erases, sim, usage)

  2. Random writes, note we start with a 1MiB file. ^

    As expected, v2 is comically bad at random writes. v3 is indistinguishable from zero in the NOR case:

    (commit, reads, progs, erases, sim, usage)

  3. Logging, write 4 MiB, but limit the file to 1 MiB. ^

    In v2 this is accomplished by renaming the file, in v3 we can leverage lfs3_file_fruncate.

    v3 performs significantly better with large blocks thanks to avoiding the sync-padding problem:

    (commit, reads, progs, erases, sim, usage)

Funding ^

If you think this work is worthwhile, consider sponsoring littlefs. Current benefits include:

  1. Being able to complain about v3 not being released yet
  2. Being able to complain about the disk breakage v2 -> v3

I joke, but I truly appreciate those who have contributed to littlefs so far. littlefs, in its current form, is a mostly self-funded project, so every little bit helps.

If you would like to contribute in a different way, or have other requests, feel free to reach me at geky at geky.net.

As stabilization gets closer, I will also be open to contract work to help port/integrate/adopt v3. If this is interesting to anyone, let me know.

Thank you @micropython, @fusedFET for sponsoring littlefs, and thank you @Eclo, @kmetabg, and @nedap for your past sponsorships!

Next steps ^

For me, I think it's time to finally put together a website/wiki/discussions/blog. I'm not sure on the frequency quite yet, but I plan to write/publish the new DESIGN.md in chapters in tandem with the remaining work.

EDIT: Pinned codemap/plot links to specific commits via benchmarks.littlefs.org/tree.html
EDIT: Updated with rdonly code/stack sizes
EDIT: Added link to #1114
EDIT: Implemented simple key-value APIs
EDIT: Added lfs3_migrate stretch goal with link to #1120
EDIT: Adopted lfs3_traversal_t -> lfs3_trv_t rename
EDIT: Added link to #1125 to clarify "feature parity"

geky added 30 commits April 16, 2025 15:22
Should've probably been two commits, but:

1. Cleaned up tracebd.py's header generation to be consistent with
   dbgbmap.py and other scripts.

   Percentage fields are now consistently floats in all scripts,
   allowing user-specified precision when punescaping.

2. Moved header generation up to where we still have the disk open (in
   dbgbmap[d3].py), to avoid issues with lazy Lfs attrs trying to access
   the disk after it's been closed.

   Found while testing with --title='cksum %(cksum)08x'. Lfs tries to
   validate the gcksum last minute and things break.
This took a bit of a messy route, but these scripts should be good to go
now.
This matches dbgbmap.py and fits in with the other dbg scripts.

The choice of tracebd.py for the name was arbitrary anyways, we just
needed something that wouldn't conflict with other scripts.
No one is realistically ever going to use this.

Ascii art is just too low resolution, trying to pad anything just wastes
terminal space. So we might as well not support --padding and save on
the additional corner cases.

Worst case, in the future we can always find this commit and revert
things.
This isn't true, especially for dbgbmap.py, 100% is very possible in
filesystems with small files. But by limiting padding to 99.9%, we avoid
the annoying wasted space caused by the rare but occasional 100.0%.
This automatically minimizes the status strings without flickering, all
it took was a bit of ~*global state*~.

---

If I'm remembering correctly, this was actually how tracebd.py used to
work before dbgbmap.py was added. The idea was dropped with dbgbmap.py
since dbgbmap.py relied on watch.py for real-time rendering and couldn't
persist state.

But now dbgbmap.py has its own -k/--keep-open flag, so that's not a
problem.
After all, who doesn't love a good bit of flickering.

I think I was trying to be too clever, so reverting.

Printing these with no padding is the simplest solution, provides the
best information density, and worst case you can always add -s1 to limit
the update frequency if flickering is hurting readability.
So now the result scripts always require -d/--diff to diff:

- before: ./scripts/csv.py a.csv -pb.csv
- after:  ./scripts/csv.py a.csv -db.csv -p

For a couple reasons:

- Easier to toggle
- Simpler internally to only have one diff path flag
- The previous behavior was a bit unintuitive
This affects the table renderers as well as csv.py's ratio expr.

This is a bit more correct, handwaving 0/0 (mapping 0/0 -> 100% is
useful for cov.py, please don't kill me mathematicians):

  frac(1,0) => 1/0 (∞%)
  frac(0,0) => 0/0 (100.0%)
  frac(0,1) => 0/1 (0.0%)
This prefix was extremely arbitrary anyways.

The prefix Csv* has slightly more meaning than R*, since these scripts
interact with .csv files quite a bit, and it avoids confusion with
rbyd-related things such as Rattr, Ralt, etc.
- CsvInt.x -> CsvInt.a
- CsvFloat.x -> CsvFloat.a
- Rev.x -> Rev.a

This matches CsvFrac.a (paired with CsvFrac.b), and avoids confusion
with x/y variables such as Tile.x and Tile.y.

The other contender was .v, since these are cs*v* related types, but
sticking with .a gets the point across that the name really doesn't have
any meaning.

There's also some irony that we're forcing namedtuples to have
meaningless names, but it is useful to have a quick accessor for the
internal value.
This allows -w to provide a shortform flag for both --wear and
--block-cycles, depending on if you include a cycles argument:

- -w    => --wear
- -w100 => --block-cycles=100

I was originally hesitant to add this since it's inconsistent from
--read/--prog/--erase, which can't have shortforms due to flag
conflicts, but --wear is probably a special enough case.
Two new tricks:

1. Hide the cursor while redrawing the ring buffer.

2. Build up the entire redraw in RAM first, and render everything in a
   single write call.

These _mostly_ get rid of the cursor flickering issues in rapidly
updating scripts.
Dang, this touched like every single script.
- Added TreeArt __bool__ and __len__.

  This was causing a crash in _treeartfrommtreertree when rtree was
  empty.

  The code was not updated in the set -> TreeArt class transition, and
  went unnoticed because it's unlikely to be hit unless the filesystem
  is corrupt.

  Fortunately(?) realtime rendering creates a bunch of transiently
  corrupt filesystem images.

- Tweaked lookupleaf to not include mroots in their own paths.

  This matches the behavior of leaf mdirs, and is intentionally
  different from btree's lookupleaf which needs to lookup the leaf rattr
  to terminate.

- Tweaked leaves to not remove the last path entry if it is an mdir.

  This hid the previous lookupleaf inconsistency. We only remove the
  last rbyd from the path because it is redundant, and for mdirs/mroots
  it should never be redundant.

  I ended up just replacing the corrupt check with an explicit check
  that the rbyd is redundant. This should be more precise and avoid
  issues like this in the future.

  Also adopted explicit redundant checks in Btree.leaves and
  Lfs.File.leaves.
This matches the behavior of paths and helps figure out which string is
associated with which crc32c/parity when checksumming multiple strings:

  $ ./scripts/crc32c.py -s hi hello
  f59dd9c2  hi
  9a71bb4c  hello

It also might help clear up confusion if someone forgets to quote a
string with spaces inside it.
This just gives dbgtag.py a few more bells and whistles that may be
useful:

- Can now parse multiple tags from hex:

    $ ./scripts/dbgtag.py -x 71 01 01 01 12 02 02 02
    71 01 01 01    altrgt 0x101 w1 -1
    12 02 02 02    shrubdir w2 2

  Note this _does_ skip attached data, which risks some confusion but
  not skipping attached data will probably end up printing a bunch of
  garbage for most use cases:

    $ ./scripts/dbgtag.py -x 01 01 01 04 02 02 02 02 03 03 03 03
    01 01 01 04    gdelta 0x01 w1 4
    03 03 03 03    struct 0x03 w3 3

- Included hex in output. This is helpful for learning about the tag
  encoding and also helps identify tags when parsing multiple tags.

  I considered also included offsets, which might help with
  understanding attached data, but decided it would be too noisy. At
  some point you should probably jump to dbgrbyd.py anyways...

- Added -i/--input to read tags from a file. This is roughly the same as
  -x/--hex, but allows piping from other scripts:

    $ ./scripts/dbgcat.py disk -b4096 0 -n4,8 | ./scripts/dbgtag.py -i-
    80 03 00 08    magic 8

  Note this reads the entire file in before processing. We'd need to fit
  everything into RAM anyways to figure out padding.
These mimic dbgtag.py, but provide debugging for the lower-level integer
primitives in littlefs:

  $ ./scripts/dbgleb128.py -x 2a 80 80 a8 01
  2a             42
  80 80 a8 01    2752512

  $ ./scripts/dbgle32.py -x 2a 00 00 00 00 00 2a 00
  2a 00 00 00    42
  00 00 2a 00    2752512

dbgleb128.py is probably going to be more useful, but I figured we might
as well include both for completeness. Though dbgle32.py is begging to
be generalized.
Do you see the O(n^2) behavior in this loop?

  j = 0
  while j < len(data):
      word, d = fromleb(data[j:])
      j += d

The slice, data[j:], creates a O(n) copy every iteration of the loop.

A bit tricky. Or at least I found it tricky to notice. Maybe because
array indexing being cheap is baked into my brain...

Long story short, this repeated slicing resulted in O(n^2) behavior in
Rbyd.fetch and probably some other functions. Even though we don't care
_too_ much about performance in these scripts, having Rbyd.fetch run in
O(n^2) isn't great.

Tweaking all from* functions to take an optional index solves this, at
least on paper.

---

In practice I didn't actually find any measurable performance gain. I
guess array slicing in Python is optimized enough that the constant
factor takes over?

(Maybe it's being helped by us limiting Rbyd.fetch to block_size in most
scripts? I haven't tested NAND block sizes yet...)

Still, it's good to at least know this isn't a bottleneck.
This is limited to dbgle32.py, dbgleb128.py, and dbgtag.py for now.

This more closely matches how littlefs behaves, in that we read a
bounded number of bytes before leb128 decoding. This minimizes bugs
related to leb128 overflow and avoids reading inherently undecodable
data.

The previous unbounded behavior is still available with -w0.

Note this gives dbgle32.py much more flexibility in that it can now
decode other integer widths. Uh, ignore the name for now. At least it's
self documenting that the default is 32-bits...

---

Also fixed a bug in fromleb128 where size was reported incorrectly on
offset + truncated leb128.
Now that I know my way around the weirdness that is Python's class
scope, this just required another function indirection to capture the
class-level dicts correctly.

I was considering using the __subclasses__ trick, but it seems like that
would actually be more complicated here.
This replaces the previous fallback-to-what's-available behavior with
explicit flags:

- --tile-code - Tile based on code size (the default)
- --tile-stack - Tile based on stack limits
- --tile-frames - Tile based on stack frames
- --tile-ctx - Tile based on function context
- --tile-1 - Tile functions evenly

This has the benefit of 1. being easier to toggle, 2. being explicit,
and 3. allowing code/stack/ctx in punescapes (titles, labels, etc).

There is an interesting question if --no-stack should be implicit, since
showing two stack treemaps may be confusing, but I think that's trying
to be too clever. Instead I just added the -S/--no-stack shortform to
make it easier to toggle.

Also updated ctx.py's description string. Probably need to check what
else is out of date in other scripts as well.
Mainly adopting the added flexibility in csv.py, also adding make
codemap-svg and friends for code map generation:

- Split result commands into separate result, result-csv, and
  result-diff commands so csv generation is explicit.

  So make result no longer implicitly overwrites csv files:

    make code
    make code-csv  -.
    make code       |
    make code-diff <'

  This gives more control over result diffing.

  make code-csv _is_ more or less just a dependency on the lfs.code.csv
  rule, but it avoids BUILDDIR mess and is easier to remember.

- Added make codemap/stackmap/ctxmap for in-terminal code/stack/ctx
  ascii art.

  I was a bit on the fence on these, since the result is more pretty
  than useful, but eh, can always drop them in the future.

- Added make codemap-svg/codemap-tiny-svg for generating interactive
  codemap svgs.

  This raised an interesting question if the make commands should
  generate light or dark mode svgs. I settled on dark mode since that's
  what I personally find the most useful.

  I think the way this will breakdown is with dark mode generally used
  for development, and light mode generally used for published material.
  And it's not too hard to run the script outside of the Makefile for
  publishing. Or override CODEMAPFLAGS.

- Adopted implicit prefixing, -q, etc. This simplifies some of the more
  complicated csv.py invocations (make summary, make funcs, etc).

See make help for a full list of commands.
So now:
               (block_size)
  mbits = nlog2(----------) = nlog2(block_size) - 3
               (     8    )

Instead of:

               (     (block_size))
  mbits = nlog2(floor(----------)) = nlog2(block_size & ~0x7) - 3
               (     (     8    ))

This makes the post-log - 3 formula simpler, which we probably want to
prefer as it avoids a division. And ceiling is arguably more intuitive
corner case behavior.

This may seem like a minor detail, but because mbits is purely
block_size derived and not configurable, any quirks here will become
a permanent compatibility requirement.

And hey, it saves a couple bytes (I'm not really sure why, the division
should've been optimized to a shift):

           code          stack          ctx
  before: 35528           2440          636
  after:  35520 (-0.0%)   2440 (+0.0%)  636 (+0.0%)
- mdir_bits -> mbits
- lfsr_mid_bid -> lfsr_mbid
- lfsr_mid_rid -> lfsr_mrid

These now match the naming in the dbg scripts.

I feel like this is more terse in a way that is also more readable, but
maybe that's just me.
So mbid=0 now implies the mdir is not inlined.

Downsides:

- A bit more work to calculate
- May lose information due to masking everything when mtree.weight==0
- Risk of confusion when in-lfs.c state doesn't match (mbid=-1 is
  implied by mtree.weight==0)

Upsides:

- Includes more information about the topology of the mtree
- Avoids multiple dbgmbids for the same physical mdir

Also added lfsr_dbgmbid and lfsr_dbgmrid to help make logging
easier/more consistent.

And updated dbg scripts.
The exception being LFS_DEBUG. A bit inconsistent, but the at least
consistent with LFS_ERR* vs LFS_ERROR, and may help reduce name
conflicts:

- LFS_DEBUGRBYDFETCHES -> LFS_DBGRBYDFETCHES
- LFS_DEBUGRBYDBALANCE -> LFS_DBGRBYDBALANCE
- LFS_DEBUGRBYDCOMMITS -> LFS_DBGRBYDCOMMITS
- LFS_DEBUGBTREEFETCHES -> LFS_DBGBTREEFETCHES
- LFS_DEBUGBTREECOMMITS -> LFS_DBGBTREECOMMITS
- LFS_DEBUGMDIRFETCHES -> LFS_DBGMDIRFETCHES
- LFS_DEBUGMDIRCOMMITS -> LFS_DBGMDIRCOMMITS
- LFS_DEBUGALLOCS -> LFS_DBGALLOCS
This was a surprising side-effect the script rework: Realizing the
internal btree/rbyd lookup APIs were awkwardly inconsistent and could be
improved with a couple tweaks:

- Adopted lookupleaf name for functions that return leaf rbyds/mdirs.

  There's an argument this should be called lookupnextleaf, since it
  returns the next bid, unlike lookup, but I'm going to ignore that
  argument because:

  1. A non-next lookupleaf doesn't really make sense for trees where
     you don't have to fetch the leaf (the mtree)

  2. It would be a bit too verbose

- Adopted commitleaf name for functions that accept leaf rbyds.

  This makes the lfsr_bshrub_commit -> lfsr_btree_commit__ mess a bit
  more readable.

- Strictly limited lookup and lookupnext to return rattrs, even in
  complex trees like the mtree.

  Most use cases will probably stick to the lookupleaf variants, but at
  least the behavior will be consistent.

- Strictly limited lookup to expect a known bid/rid.

  This only really matters for lfsr_btree/bshrub_lookup, which as a
  quirk of their implementation _can_ lookup both bid + rattr at the
  same time. But I don't think we'll need this functionality, and
  limited the behavior may allow for future optimizations.

  Note there is no lfsr_file_lookup. File btrees currently only ever
  have a single leaf rattr, so this API doesn't really make sense.

Internal API changes:

- lfsr_btree_lookupnext_ -> lfsr_btree_lookupleaf
- lfsr_btree_lookupnext  -> lfsr_btree_lookupnext
- lfsr_btree_lookup      -> lfsr_btree_lookup
- added                     lfsr_btree_namelookupleaf
- lfsr_btree_namelookup  -> lfsr_btree_namelookup
- lfsr_btree_commit__    -> lfsr_btree_commit_
- lfsr_btree_commit_     -> lfsr_btree_commitleaf
- lfsr_btree_commit      -> lfsr_btree_commit

- added                     lfsr_bshrub_lookupleaf
- lfsr_bshrub_lookupnext -> lfsr_bshrub_lookupnext
- lfsr_bshrub_lookup     -> lfsr_bshrub_lookup
- lfsr_bshrub_commit_    -> lfsr_bshrub_commitleaf
- lfsr_bshrub_commit     -> lfsr_bshrub_commit

- lfsr_mtree_lookup      -> lfsr_mtree_lookupleaf
- added                     lfsr_mtree_lookupnext
- added                     lfsr_mtree_lookup
- added                     lfsr_mtree_namelookupleaf
- lfsr_mtree_namelookup  -> lfsr_mtree_namelookup

- added                     lfsr_file_lookupleaf
- lfsr_file_lookupnext   -> lfsr_file_lookupnext
- added                     lfsr_file_commitleaf
- lfsr_file_commit       -> lfsr_file_commit

Also added lookupnext to Mdir/Mtree in the dbg scripts.

Unfortunately this did add both code and stack, but only because of the
optional mdir returns in the mtree lookups:

           code          stack          ctx
  before: 35520           2440          636
  after:  35548 (+0.1%)   2472 (+1.3%)  636 (+0.0%)
This lets us cram in one more mask for potential redund bits:

  name                 tag    mask
  LFSR_TAG_MASK0    0x0000  0x0fff  ---- 1111 1111 1111
  LFSR_TAG_MASK2    0x1000  0x0ffc  ---- 1111 1111 11--
  LFSR_TAG_MASK8    0x2000  0x0f00  ---- 1111 ---- ----
  LFSR_TAG_MASK12   0x3000  0x0000  ---- ---- ---- ----
                                    '.-' '.-' '---.---'
                          mode bits -'    |       |   ^
                            suptype ------'       |   |
                            subtype --------------'   |
                        redund bits ------------------'

I toyed around with a bitwise alternative to the lookup table, but
couldn't come up with anything simpler than these:

- 0xfff & ~((((1<<((i>>1)*8))-1) << ((i&1)*4)) | ((1<<(i*2))-1))
- 0xfff & ~((1 << (((i>>1)*8)+((i&1)<<(1+(i>>1)))))-1)
- 0xfff & ~((1<<(2*i*i))-1) (requires multiply and 32-bit shift)

---

This also replaces the mdir/rbyd/btree/mtree lookup/sublookup/suplookup
functions with a single flexible lookup function that accepts tag masks.

This ended up adding a bit of code/stack (the extra NULL args are
surprisingly pricey), but will hopefully make the redund bits
easier/cheaper to use:

           code          stack          ctx
  before: 35548           2472          636
  after:  35584 (+0.1%)   2480 (+0.3%)  636 (+0.0%)
So now crystal_thresh only controls when fragments are compacted into
blocks, while fragment_thresh controls when blocks are broken into
fragments. Setting fragment_thresh=-1 will follow crystal_thresh and
keeps the previous behavior.

These were already two separate pieces of logic, so it makes sense to
provide two separate knobs for tuning.

Setting fragment_thresh lower than crystal_thresh has some potential to
reduce hysteresis in cases where random writes push blocks close to
crystal_thresh. It will be interesting to explore this more when
benchmarking.

---

The additional config option adds a bit of code/ctx, but hopefully that
will go away in the future config rework:

           code          stack          ctx
  before: 35584           2480          636
  after:  35600 (+0.0%)   2480 (+0.0%)  640 (+0.6%)
geky added 5 commits July 18, 2025 16:42
- Moved block allocator definitions into their own dedicated block
  before the lfs3_bptr_t stuff:

  - lfs3_alloc_discard
  - lfs3_alloc_ckpoint
  - lfs3_alloc

  I'm looking into adding lfs3_alloc related flags, and these aren't
  really predeclarable like C's function prototypes.

  Predeclaring these before lfs3_bptr_t is relevant if we ever add
  lfs3_bptr_alloc.

  Also touched up relevant comments a bit.

- Also added inline to lfs3_alloc_ckpoint and lfs3_alloc_discard, which
  was strangely missing?

  The compiler figured it out anyways, so this has no impact on code
  cost.

- Moved ecksum definitions below lfs3_bptr_t stuff.

  I don't really know where to put these, but close to the lfs3_rbyd_t
  stuff makes sense.

- And updated lazy appendrattr_ ordering to match source code order.

No code changes.
Currently this just has one flag the replaces the previous `erase`
argument:

  LFS3_ALLOC_ERASE  0x00000001  Please erase the block

Benefits include:

- Slightly better readability at lfs3_alloc call sites.

- Possibility of more allocator flags in the future:

  - LFS3_ALLOC_EMERGENCY - Use reserved blocks

  - Uh, that's all I can think of right now

No code changes.
The isempty check has already been deduplicated. Deduplicating this more
starts to code smell.
This just wraps up block allocation + struct initialization similarly to
lfs3_rbyd_alloc.

Also tweaked bptr updates in lfs3_file_crystallize__ to mutate the bptr
fields directly instead of going through lfs3_bptr_init.

Neither of these impacted code cost, everything ends up inlined in
lfs3_file_crystallize__ anyways. Hopefully it helps with readability at
least.

Also LFS3_KVONLY mode is completely broken, and fixing it is not a huge
priority. I don't think it makes sense to adopt lfs3_bptr_alloc in
lfs3_file_flushset_ anyways, it would just lead to us initializing the
lfs3_bptr_t struct twice for no real reason.

No code changes.
- lfs3_omdir_t -> lfs3_handle_t
- lfs3.omdirs -> lfs3.handles
- o -> h
- lfs3_omdir_* -> lfs3_handle_*
- lfs3_omdir_ismidopen -> lfs3_mid_isopen

From conversations with users, the term "handle" or "file handle" seems
to be the most common/easily understood term for the lfs3_file_t struct
itself. It makes sense to adopt this in our codebase.

I usually dislike inventing new names for things when prefixes can imply
a relationship (size -> ssize, cache -> rcache, shrub -> bshrub, etc),
but lfs3_omdirs_t was probably a bit much.
geky added 18 commits July 18, 2025 18:28
- test_traversal -> test_trvs
- lfs3_traversal_t -> lfs3_trv_t
- lfs3_btraversal_t -> lfs3_btrv_t
- t -> trv
- bt -> btrv
- lfs3_traversal_* -> lfs3_trv_*
- lfs3_btraversal_* -> lfs3_btrv_*

The traversal type is becoming one of the more fundamental types in
littlefs, and if DIR and REG both get shortened names, it makes sense
for TRV to have one as well.

This also removes the temptation to use t for traversals, which is
probably an even worse name.

---

Note that lfs3_btree_traverse, lfs3_mtree_traverse, etc, remain
unaffected. This may change in the future, but it's interesting to note
that verbs seem to need much less typing than nouns.
Not sure how this got overlooked. Now that graft traversals are
implemented directly in lfs3_alloc, there's no reason to store this
state globally.

Fortunately this was in a union, so it didn't actually show up in our
ctx measurements.

No code changes.
Note this includes both the lfs3_config -> lfs3_cfg structs as well as
the LFS3_CONFIG -> LFS3_CFG include define:

- LFS3_CONFIG -> LFS3_CFG
- struct lfs3_config -> struct lfs3_cfg
- struct lfs3_file_config -> struct lfs3_file_cfg
- struct lfs3_*bd_config -> struct lfs3_*bd_cfg
- cfg -> cfg

We were already using cfg as the variable name everywhere. The fact that
these names were different was an inconsistency that should be fixed
since we're committing to an API break.

LFS3_CFG is already out-of-date from upstream, and there's plans for a
config rework, but I figured I'd go ahead and change it as well to lower
the chances it gets overlooked.

---

Note this does _not_ affect LFS3_TAG_CONFIG. Having the on-disk vs
driver-level config take slightly different names is not a bad thing.
- enum lfs3_error -> enum lfs3_err
- err -> err

Really this just updates `enum lfs3_err` to match the prefixes used
everywhere else. And because enum types are kind of useless in C, this
has no effect on any other part of the codebase.
- enum lfs3_scmp -> enum lfs3_cmp
- cmp -> cmp

lfs3_scmp_t is still used as the type, as the s prefix indicates the
type is signed, usually for muxing with error codes.

I think that led to the enum also being named lfs3_scmp, but that's not
quite right.

But none of this really matters because enums are so useless and broken
in C.
The --no-internal flag avoids building any internal tests/benches
(tests/benches with in="lfs3.c"), which can be useful for quickly
testing high-level things while refactoring. Refactors tend to break all
the internal tests, and it can be a real pain to update everything.

Note that --no-internal can be injected into the build with TESTCFLAGS:

  TESTCFLAGS=--no-internal make test-runner -j \
      && ./scripts/test.py -j -b

For a curious data point, here's the current number of
internal/non-internal tests:

                suites          cases                  perms
  total:            24            808          633968/776298
  internal:         22 (91.7%)    532 (65.8%)  220316/310247 (34.8%)
  non-internal:      2 ( 8.3%)    276 (34.2%)  413652/466051 (65.2%)

It's interesting to note that while internal tests have more test cases,
the non-internal tests generate a larger number of test permutations.
This is probably because internal tests tend to target specific corner
cases/known failure points, and don't invite much variants.

---

While --no-internal may be useful for high-level testing during a
refactor, I'm not sure it's a good idea to rely on it for _debugging_ a
refactor.

The whole point of internal testing is to catch low-level bugs early,
with as little unnecessary state as possible. Skipping these to debug
integration tests is a bit counterproductive!
Note --list-suite-paths was already skipping case-less suites! I think
only -Y/--summary was an outlier.

This is consistent with test.py's matching of suite ids when no cases
are found (test_runner itself doesn't really care, it just reports no
matching cases). Though we do still compile case-less suites and include
them in the test_suites array, which may be confusing in the future.
This is an indulgence to simplify the upcoming auxiliary btree work.

Brings back the previously-reverted per-btree leaf caches, where each
lfs3_btree_t keeps track of two rbyds: The root and the most recently
accessed leaf.

At the surface level, this optimizes repeated access to the same btree
leaf. A common pattern for a number of littlefs's operations that has
proven tricky to manually optimize:

- Btree iteration
- Pokes for our crystalization heuristic
- Checksum collision resolution for dids and (FUTURE) ddkeys
- Related rattrs attached to a single bid

But the real motivation is to drop lfs3_btree_*lookupleaf and simplify
the internal APIs. If repeated lfs3_btree_lookup*s are already
efficient, there's no reason for extra leaf-level APIs, and in theory
any logic that interacts with btrees will be simpler.

---

This comes at a cost (humorously about the same amount as the
tag-returning refactor, if you ignore the extra 28 bytes of ctx).
Unsurprisingly, increasing the size of lfs3_btree_t has the biggest
impact on stack and ctx:

           code          stack          ctx
  before: 36084           2336          656
  after:  36784 (+1.9%)   2400 (+2.7%)  684 (+4.3%)

Also note from the previous commit messages: Btree leaf caching has
resulted in surprisingly little performance improvement for our current
benchmarks + implementation. It turns out if you're dominated by write
cost, optimizing btree lookups -- which already skip rbyd fetches, has
barely noticeable impact.

---

A note on reverting!

Eventually (after the auxiliary btree work) it will probably make sense
to revert this -- or at least provide a non-leaf-caching build for
code/RAM sensitive users.

I don't think this should be reverted as-is. Instead, I think we should
allow the option to just disable the leaf cache, while keeping the
simpler internal API. This would give us the best of all three worlds:

- A small code/RAM option
- Optimal btree iteration/nearby-lookup performance
- Simpler internal APIs

The only reason this isn't already implemented is because I want to
avoid fragmenting the codebase further while we're still in development
mode.
What a mouthful.

The unconditional bshrub leaf discarding in lfs3_mdir_commit was copied
from the previous btree leaf caching implementation, but discarding
_all_ bshrub leaves on _every_ mdir commit is a bit insane.

Really, the only bshrub leaves that ever need to be discarded here are
the shrub roots, which are already questionable leaf caching targets
because they're already cached as the root rbyd.

An alternative option would be to just never cache shrub roots, but
tinkering around with the idea showed it would be more costly that
conditionally discarding leaves in lfs3_mdir_commit. At least here we
can reuse some of the logic that discards file leaves.

I'm also probably overthinking what is only a small code cost:

           code          stack          ctx
  before: 36784           2400          684
  after:  36792 (+0.0%)   2400 (+0.0%)  684 (+0.0%)

This doesn't take into account how much CPU time is spent creating rbyd
copies, but that is not something we are optimizing for.
I was confused, but this commit->bid update is used to limit the
commit->bid to the btree weight. Note we limit the bid after storing it
as the initial rid.
This comes from an observation that we never actually use the leaf cache
during traversals, and there is surprisingly little risk of a lookup
creating a conflict in the future.

Btree traversal fall into two categories:

1. Full traversals, where we traverse a full btree all at once. These
   are unlikely to have lookup conflicts because everything is
   usually self-contained in one chunk of logic.

2. Incremental traversals. These _are_ at risk, but in our current
   design limited to lfs3_trv_t, which already creates a fully
   bshrub/btree copy for tracking purposes.

   This copy unintentionally, but conveniently, protects against lookup
   conflicts.

So, why not reuse the btree leaf cache to hold the rbyd state during
traversals? In theory this makes lfs3_btree_traverse the same cost and
lfs3_btree_lookupnext, drops the need for lfs3_btrv_t, and simplifies
the internal API.

The only extra bit of state we need is the current target bid, which is
now expected as a caller-incremented argument similar to
lfs3_btree_lookupnext iteration.

There was a bit of futzing around with bid=-1 being necessary to
initialize traversal (to avoid conflicts with bid=-1 => 0 caused by
empty btrees). But the end result is a btree traversal that only needs
one extra word of state.

---

Unfortunately, in practice, the savings were not as great as expected:

           code          stack          ctx
  before: 36792           2400          684
  after:  36876 (+0.2%)   2384 (-0.7%)  684 (+0.0%)

This does claw back some stack, but less than a full rbyd due to the
union with the mtortoise in lfs3_trv_t. The mtortoise now dominates. It
might be possible to union the mtortoise and the bshrub/btree state
better (both are not needed at the same time), but strict aliasing rules
in C make this tricky.

The new lfs3_btree_traverse is also a bit more complicated in terms of
code cost. In theory this would be offset by the simpler traversal setup
logic, but we only actually call lfs3_btree_traverse twice:

1. In lfs3_mtree_traverse
2. In lfs3_file_ck

Still, some stack savings + a simpler internal API makes this worthwhile
for now. lfs3_trv_t is also due for a revisit, and hopefully it's
possible to better union things with btree leaf caches somehow.
This matches other internal rbyds: btree.r, mdir.r, etc.

The intention of the single-char names is to reduce clutter around these
severely nested structs, both btrees and mdirs _are_ rbyds, so the name
doesn't really besides C-level type info.

I was hesitant on btree.leaf.rbyd, but decided consistency probably wins
here.
Looks like these traversal states were missed in the omdir -> handle
rename. I think HANDLES and HBTREE states make sense:

- LFS3_TSTATE_OMDIRS -> LFS3_TSTATE_HANDLES
- LFS3_TSTATE_OBTREE -> LFS3_TSTATE_HBTREE
Just moving away from the *trv when unnecessary. This matches the h
variable used for local iteration.
This forces our cycle detection tortoise (previously trv.u.mtortoise),
into the unused shrub leaf via pointer shenanigans.

This reclaims the remaining stack (and apparently code) we theoretically
gained from the btree traversal rework, up until the compiler got in the
way:

           code          stack          ctx
  before: 36876           2384          684
  after:  36852 (-0.1%)   2368 (-0.7%)  684 (+0.0%)

And it only required some _questionably_ defined behavior.

---

It's probably not well-defined behavior, but trying to understand what
the standard actually means on this is giving me a headache. I think I
have to agree C99+strict-aliasing lost the plot on this one. Note
mtortoise is only ever written/read through the same type.

What I want:

  lfs3_trv_t:          lfs3_bshrub_t:       lfs3_handle_t:
  .---+---+---+---. .. .---+---+---+---. .. .---+---+---+---.
  |     handle    |    |     handle    |    |     handle    |
  |               |    |               |    |               |
  +---+---+---+---+    +---+---+---+---+ .. '---+---+---+---'
  |   root rbyd   |    |   root rbyd   |
  |               |    |               |    lfs3_mtortoise_t:
  +---+---+---+---+    +---+---+---+---+ .. .---+---+---+---.
  |   leaf rbyd   |    |   leaf rbyd   |    |   mtortoise   |
  |               |    |               |    |               |
  +---+---+---+---+    +---+---+---+---+ .. '---+---+---+---'
  | staging rbyd  |    | staging rbyd  |
  |               |    |               |
  +---+---+---+---+ .. '---+---+---+---'
  |               |
  :               :

But I'm starting to think this is simply not possible in modern C.

At least this shows what is theoretically possible if we didn't have to
fight the compiler.
I think modern C simply doesn't let us do what we want to do here, so
I'm giving up, discarding the lfs3_mtortoise_t type, and just abusing
various unrelated shrub fields to implement the tortoise. This
sacrifices readability, but at least avoids undefined behavior without
a RAM penalty:

- shrub.blocks => tortoise blocks
- shrub.weight => cycle distance
- shrub.eoff => power-of-two bound

Note this keeps trunk=0, which is a nice safety net in case some code
ever tries to read from the shrub in the future.

Fortunately the mtortoise logic is fairly self-contained in
lfs3_mtree_traverse_, so with enough comments hopefully the code is not
too confusing.

---

Apparently shaves off a couple more bytes of code. I'm guessing this is
just because of the slightly different struct offsets (we're reusing the
root's rbyd instead of the leaf's rbyd now):

           code          stack          ctx
  before: 36852           2368          684
  after:  36844 (-0.0%)   2368 (+0.0%)  684 (+0.0%)
This fixes a strict aliasing violation in lfs3_file_sync_, where we cast
the file cache -> lfs3_data_t to avoid an extra stack allocation, by
modifying the file's cache struct to use an lfs3_data_t directly.

- file.cache.pos -> file.cache.pos
- file.cache.buffer -> file.cache.d.u.buffer_
- file.cache.size -> file.cache.d.size
- (const lfs3_data_t*)&file->cache -> &file->cache.d

Note the underscore_ in file.cache.d.u.buffer_. This did not fit
together as well as I had hoped, due to different const expectation
between the file cache and lfs3_data_t.

Up until this point lfs3_data_t has only been used to refer to const
data (ignoring side-band pointer casting in lfs3_mtree_traverse*), while
the file cache very much contains mutable data. To work around this I
added data.u.buffer_ as a mutable variant, which works, but risks an
accidental const violation in the future.

---

Unfortunately this does come with a minor RAM cost, since we no longer
hide file.cache.pos in lfs3_data_t's buffer padding:

           code          stack          ctx
  before: 36844           2368          684
  after:  36832 (-0.0%)   2376 (+0.3%)  684 (+0.0%)

  lfs3_file_t before: 164
  lfs3_file_t after:  168 (+2.4%)

I think it's pretty fair to call C's strict aliasing rules a real wet
blanket. It would be interesting to create a -fno-strict-aliasing
variant of littlefs in the future, to see how much code/RAM could be
saved if we were given free reign to abuse the available memory.

Probably not enough to justify the extra work, but it would be an
interesting experiment.
This abandons the data-backed cache idea due to concerns around
readability and maintainability. Mixing const/mutable buffers in
lfs3_data_t was not great.

Instead, we now just allocate an indirect lfs3_data_t on the stack in
lfs3_file_sync_ to avoid the previous undefined behavior.

This actually results in less stack usage total, due to lfs3_file_t
allocations in lfs3_set/read, and avoid the more long-term memory cost
in lfs3_file_t:

              code          stack          ctx
  before:    36832           2376          684
  after:     36840 (+0.0%)   2368 (-0.3%)  684 (+0.0%)

Oh. And lfs3_file_sync_ isn't even on the stack hot-path, so this is a
net benefit over the previous cache -> data cast:

              code          stack          ctx
  before sa: 36844           2368          684
  after sa:  36840 (-0.0%)   2368 (+0.0%)  684 (+0.0%)

Still less cool though.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs major version breaking functionality only allowed in major versions next major on-disk major WEEWOOWEEWOO v3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v3-alpha: Discussion
5 participants