Skip to content

[Data Liberation] Entity Stream Importer #1980

@adamziel

Description

@adamziel

Let's build plumbing to load data into WordPress.

I think any data source can be represented as a stream of structured entities.

  • WP_WXR_Reader sources them from a WXR file
  • A markdown importer could do the same for markdown files
  • WordPress -> Wordpress could be the same story

See this relevant visual from WordPress for Docs:

Image

Importing data

WXR importers must answer these questions:

  • What if a post with a given ID does or doesn't exists?
  • What if there's a partial difference between the two posts? Do we ignore it? Reconcile? Ask the user? Which post wins?
  • What if the author does or doesn't exist in the database?
  • Ditto for tags, categories, post meta etc.

Let's view a WXR file as a flat list of entity objects such as posts, comments, meta, etc. We can now represent a lot of scenarios as list concatenation:

  • Importing WXR into a WordPress site is WordPress entities ++ WXR Entities
  • Importing two WXR files is WXR Entities ++ WXR Entities
  • Pausing and resuming WXR import is Entities before pause ++ Entities after pause
  • Importing WordPress -> WordPress is WordPress 1 Entities ++ WordPress 2 Entities.
  • Syncing WP -> WP is WordPress 1 Entities ++ WordPress 2 entities ++ WordPress 1 deletions ++ WordPress 2 deletions

From there, we'd need to reduce those lists to contain zero or one entries representing each object.

This is already similar to journaling MEMFS to OPFS in the Playground webapp. It also resembles map/reduce problems where parts of the processing can be parallelized while other parts must be processed sequentially.

I bet we can find a unified way of reasoning about all these scenarios and build a single data ingestion pipeline for any data source.

Let's see how far can we get with symbols and reasoning before writing code. I'm sure there are existing white papers and open source projects working through this exact problem.

Resources

  • Existing WXR importers
  • Importers from other data formats
  • Site sync plugins

cc @brandonpayton

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions