Skip to content

Mooncake-Labs/moonlink

Repository files navigation

Moonlink ๐Ÿฅฎ

managed ingestion engine for Apache Iceberg

License Slack Twitter Docs

Overview

Moonlink is an Iceberg-native ingestion engine bringing streaming inserts and upserts to your lakehouse.

Ingest Postgres CDC, event streams (Kafka), and OTEL into Iceberg without complex maintenance and compaction.

Moonlink buffers, caches, and indexes data so Iceberg tables stay read-optimized.


             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€moonlinkโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                         
             โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€Icebergโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚  โ”‚                       โ”‚  โ”‚  โ”‚      obj. store     โ”‚
Postgres โ”€โ”€โ”€โ–บโ”‚  โ”‚โ”Œ โ”€ โ”€ โ”€ โ”€ โ” โ”Œ โ”€ โ”€ โ”€ โ”€ โ”โ”‚  โ”‚  โ”‚โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚
             โ”‚  โ”‚                       โ”‚  โ”‚  โ”‚โ”‚       โ”‚ โ”‚         โ”‚โ”‚
Kafka    โ”€โ”€โ”€โ–บโ”‚  โ”‚โ”‚  index  โ”‚ โ”‚  cache  โ”‚โ”‚  โ”œโ”€โ”€โ–บโ”‚ index โ”‚ โ”‚ parquet โ”‚โ”‚
             โ”‚  โ”‚                       โ”‚  โ”‚  โ”‚โ”‚       โ”‚ โ”‚         โ”‚โ”‚
Events   โ”€โ”€โ”€โ–บโ”‚  โ”‚โ”” โ”€ โ”€ โ”€ โ”€ โ”˜ โ”” โ”€ โ”€ โ”€ โ”€ โ”˜โ”‚  โ”‚  โ”‚โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚
             โ”‚  โ”‚                  nvme โ”‚  โ”‚  โ”‚                     โ”‚
             โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                         

Note: Moonlink is in preview. Expect changes. Join our Community to stay updated!

Why Moonlink?

Traditional ingestion tools write data and metadata files per update into Iceberg. That's fine for slow-changing data, but on real-time streams it causes:

  • Tiny data files โ€” frequent commits create thousands of small Parquet files
  • Metadata explosion โ€” equality-deletes compound this problem

which leads to:

  • Slow read performance โ€” query planning overhead scales with file count
  • Manual maintenance โ€” periodic Spark jobs for compaction/cleanup

Moonlink minimizes write amplification and metadata churn by buffering incoming data, building indexes and caches on NVMe, and committing read-optimized files and deletion vectors to Iceberg.

Inserts are buffered and flushed as size-tuned Parquet

          โ”Œโ”€โ”€โ”€moonlinkโ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€icebergโ”€โ”€โ”€โ”
          โ”‚โ”Œโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”โ”‚  โ”‚โ”Œโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”โ”‚
raw insertโ”‚              โ”‚  โ”‚              โ”‚
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚โ”‚   Arrow    โ”‚โ”œโ”€โ–บโ”‚โ”‚   Parquet  โ”‚โ”‚
          โ”‚              โ”‚  โ”‚              โ”‚
          โ”‚โ””โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”˜โ”‚  โ”‚โ””โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”˜โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Deletes are mapped to deletion vectors using an index built on row positions

           โ”Œโ”€โ”€โ”€moonlinkโ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€icebergโ”€โ”€โ”€โ”
           โ”‚โ”Œโ”€ โ”€ โ”€โ”€โ”€ โ”€ โ”€ โ”โ”‚   โ”‚โ”Œโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”โ”‚
raw deletesโ”‚              โ”‚   โ”‚              โ”‚
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚โ”‚   index    โ”‚โ”œโ”€โ”€โ–บโ”‚โ”‚  deletion  โ”‚โ”‚
           โ”‚              โ”‚   โ”‚   vectors    โ”‚
           โ”‚โ””โ”€ โ”€ โ”€โ”€โ”€ โ”€ โ”€ โ”˜โ”‚   โ”‚โ””โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”˜โ”‚
           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Write Paths

Moonlink supports multiple input sources for ingest:

  1. PostgreSQL CDC โ€” ingest via logical replication with millisecond-level latency
  2. REST API โ€” simple HTTP endpoint for direct event ingestion
  3. Kafka โ€” sink support coming soon
  4. OTEL โ€” sink support on the roadmap

Read Path

Moonlink commits data as Iceberg v3 tables with deletion vectors. These tables can be queried from any Iceberg-compatible engine.

Engines

  1. DuckDB
  2. Apache Spark
  3. Postgres with pg_duckdb or pg_mooncake

Catalogs

  1. AWS Glue โ€” coming soon
  2. Unity Catalog โ€” coming soon

Real-Time Reads (<s freshness)

For workloads requiring sub-second visibility into new data, Moonlink supports real-time querying:

  1. DuckDB โ€” with the duckb_mooncake extension.
  2. Postgres โ€” with the pg_mooncake extension.
  3. DataFusion โ€“ with Moonlink Datafusion

Quick Start

1. Clone & Build

Clone the repository and build the service binary:

git clone https://github.com/Mooncake-Labs/moonlink.git
cd moonlink
cargo build --release --bin moonlink_service

2. Start the Moonlink Service

Start the Moonlink service, which will store data in the ./data directory:

./target/release/moonlink_service ./data

3. Verify Service Health

Check that the service is running properly:

curl http://localhost:3030/health

4. Create a Table

Create a table with a defined schema. Here's an example creating a users table:

curl -X POST http://localhost:3030/tables/users \
  -H "Content-Type: application/json" \
  -d '{
    "database": "my_database",
    "table": "users",
    "schema": [
      {"name": "id", "data_type": "int32", "nullable": false},
      {"name": "name", "data_type": "string", "nullable": false},
      {"name": "email", "data_type": "string", "nullable": true},
      {"name": "age", "data_type": "int32", "nullable": true},
      {"name": "created_at", "data_type": "date32", "nullable": true}
    ],
    "table_config": {}
  }'

5. Insert Data

Insert data into the created table:

curl -X POST http://localhost:3030/ingest/users \
  -H "Content-Type: application/json" \
  -d '{
    "operation": "insert",
    "request_mode": "async",
    "data": {
      "id": 1,
      "name": "Alice Johnson",
      "email": "alice@example.com",
      "age": 30,
      "created_at": "2024-01-01"
    }
  }'

Roadmap and Contributing

Roadmap (nearโ€‘term):

  1. Kafka sink preview
  2. Schema evolution from Postgres and Kafka
  3. Catalog integrations (AWS Glue, Unity Catalog)
  4. REST API stabilization (Insert, Upsert into Iceberg directly)

Weโ€™re grateful for our contributors. If you'd like to help improve Moonlink, join our community.

๐Ÿฅฎ

About

Simple & Real-Time Ingestion into Apache Iceberg.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages