Skip to content

Conversation

wdberkeley
Copy link
Contributor

@wdberkeley wdberkeley commented Jun 5, 2025

This adds Iceberg REST data catalog support for AWS Glue.

There were broadly three obstacles:

  1. Redpanda's Iceberg client did not support SigV4 authentication, which had to be generalized for AWS services besides Glue.
  2. AWS credentials are usually temporary, and the existing credential refresh functionality had to be adapted for use with the datalake catalog clients.
  3. The Glue catalog requires a base location for tables. Other supported catalogs do not: they will choose a location for you.

Please see the individual commit messages for details.

As the overall change is fairly large, this is best reviewed commit-by-commit. Most commits are small; the credential refresh service commit is the most complex.

This is manually tested to work with AWS Glue, with table data readable via AWS Athena.

Implements CORE-8726.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Features

  • This adds support for using the AWS Glue Data Catalog as an Iceberg REST catalog. To use Glue as an Iceberg REST catalog, configure the standard Iceberg REST catalog and AWS cloud storage configuration and set iceberg_rest_catalog_authentication_mode to aws_sigv4. Additionally, the Glue Data Catalog requires a base location for table storage, configured by iceberg_rest_catalog_base_location.

@wdberkeley wdberkeley requested a review from a team as a code owner June 5, 2025 07:22
@wdberkeley wdberkeley force-pushed the sigv4_glue branch 2 times, most recently from dd30b81 to e7dc9fd Compare June 5, 2025 15:48
@wdberkeley wdberkeley requested review from nvartolomei and andrwng June 5, 2025 18:02
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jun 5, 2025

Retry command for Build#66933

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/cluster_restore_test.py::DatalakeClusterRestoreTest.test_basic@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/custom_partitioning_test.py::DatalakeCustomPartitioningTest.test_sticky_default@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/datalake_dlq_test.py::DatalakeDLQMultinodeTest.test_dlq_table_with_multiple_nodes@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/recovery_mode_test.py::DatalakeRecoveryModeTest.test_recovery_mode@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeMetricsTest.test_rest_catalog_metrics@{"cloud_storage_type":1}
tests/rptest/tests/datalake/blocked_catalog_test.py::DatalakeBlockedCatalogTest.test_trim_local_after_first_translation@{"cloud_storage_type":1}
tests/rptest/tests/datalake/3rdparty_maintenance_test.py::Datalake3rdPartyMaintenanceTest.test_e2e_basic@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/datalake_dlq_test.py::DatalakeDLQTest.test_dlq_table_for_invalid_records@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_remove_expired_snapshots@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/iceberg_toggling_test.py::IcebergTogglingTest.test_iceberg_toggling@{"cloud_storage_type":1}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_avro_schema@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_iceberg_partition_key_file_location@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"custom_partition_spec":"(number)"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_latest_protobuf_schema@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_dropped_column_no_collision@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_dropped_column_no_collision@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_dropped_column_select_fails@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"drop column that appears in partition spec"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"illegal promotion int->string"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"drop column that appears in partition spec"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino"}
tests/rptest/tests/datalake/cluster_restore_test.py::DatalakeClusterRestoreTest.test_slow_tiered_storage_dlq@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/custom_partitioning_test.py::DatalakeCustomPartitioningTest.test_basic@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/custom_partitioning_test.py::DatalakeCustomPartitioningTest.test_spec_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/simple_connect_test.py::RedpandaConnectIcebergTest.test_translating_avro_serialized_records@{"cloud_storage_type":1}
tests/rptest/tests/datalake/blocked_catalog_test.py::DatalakeBlockedCatalogTest.test_block_cloud_retention_before_translation@{"cloud_storage_type":1,"with_spillover":false}
tests/rptest/tests/datalake/datalake_dlq_test.py::DatalakeDLQTest.test_dlq_table_for_mixed_records@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_e2e_basic@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_upload_after_external_update@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/datalake_verifier_test.py::DatalakeVerifierTest.test_detecting_duplicates@{"cloud_storage_type":1}
tests/rptest/tests/datalake/throttling_test.py::DatalakeThrottlingTest.test_backlog_metric@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements@{"cloud_storage_type":1}
tests/rptest/tests/polaris_catalog_smoke_test.py::PolarisCatalogSmokeTest.test_connecting_to_catalog@{"cloud_storage_type":1,"with_tls":"from_file"}
tests/rptest/tests/datalake/delayed_translation_test.py::DatalakeDelayedTranslationTest.test_basic@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"spark"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_iceberg_partition_key_file_location@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"custom_partition_spec":null}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_dropped_column_no_collision@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark"}
tests/rptest/tests/datalake/iceberg_toggling_test.py::IcebergTogglingTest.test_concurrent_mode_toggling@{"cloud_storage_type":1}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_dropped_column_select_fails@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_dropped_column_select_fails@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"drop column that appears in partition spec"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"drop column that appears in partition spec"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"illegal promotion int->string"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark"}
tests/rptest/tests/datalake/schema_scale_test.py::SchemaScaleTest.schema_scale_test@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"spark","use_partition_spec":false}
tests/rptest/tests/datalake/cluster_restore_test.py::DatalakeClusterRestoreTest.test_restore_partition_spec@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/custom_partitioning_test.py::DatalakeCustomPartitioningTest.test_many_partitions@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeMetricsTest.test_lag_metrics@{"cloud_storage_type":1}
tests/rptest/tests/datalake/blocked_catalog_test.py::DatalakeBlockedCatalogTest.test_block_cloud_retention_before_translation@{"cloud_storage_type":1,"with_spillover":true}
tests/rptest/tests/datalake/datalake_dlq_test.py::DatalakeDLQTest.test_dlq_table_for_mixed_records@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"spark"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_e2e_basic@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"query_engine":"spark"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_topic_lifecycle@{"catalog_type":"rest_jdbc","cloud_storage_type":1}
tests/rptest/tests/datalake/datalake_verifier_test.py::DatalakeVerifierTest.test_detecting_gap_in_offset_sequence@{"cloud_storage_type":1}
tests/rptest/tests/datalake/transactions_test.py::DatalakeTransactionTests.test_with_transactions@{"cloud_storage_type":1,"compaction":false}
tests/rptest/tests/polaris_catalog_smoke_test.py::PolarisCatalogSmokeTest.test_connecting_to_catalog@{"cloud_storage_type":1,"with_tls":"none"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_dropped_column_no_collision@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark"}
tests/rptest/tests/datalake/iceberg_toggling_test.py::IcebergTogglingTest.test_iceberg_mode_toggling@{"cloud_storage_type":1}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_dropped_column_no_collision@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_dropped_column_select_fails@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"drop column that appears in partition spec"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"illegal promotion int->string"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"illegal promotion int->string"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"rest_jdbc","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino"}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jun 6, 2025

CI test results

test results on build#67011
test_class test_method test_arguments test_kind job_url test_status passed reason
MaintenanceTest test_maintenance_sticky {"use_rpk": false} ducktape https://buildkite.com/redpanda/redpanda/builds/67011#019746d1-994e-49ea-aca5-a4531e694ada FLAKY 19/21 upstream reliability is '96.94915254237289'. current run reliability is '90.47619047619048'. drift is 6.47296 and the allowed drift is set to 50. The test should PASS
SimpleEndToEndTest test_relaxed_acks {"write_caching": false} ducktape https://buildkite.com/redpanda/redpanda/builds/67011#019746d2-5372-4239-8c5e-4dbad2de16d0 FLAKY 20/21 upstream reliability is '99.40828402366864'. current run reliability is '95.23809523809523'. drift is 4.17019 and the allowed drift is set to 50. The test should PASS
RecreateTopicMetadataTest test_recreated_topic_metadata_are_valid {"replication_factor": 3} ducktape https://buildkite.com/redpanda/redpanda/builds/67011#019746d2-5371-4ee7-9eda-dd356fcebe94 FLAKY 19/21 upstream reliability is '96.98189134808854'. current run reliability is '90.47619047619048'. drift is 6.5057 and the allowed drift is set to 50. The test should PASS
WriteCachingFailureInjectionTest test_unavoidable_data_loss ducktape https://buildkite.com/redpanda/redpanda/builds/67011#019746d2-5372-4239-8c5e-4dbad2de16d0 FLAKY 20/21 upstream reliability is '99.24242424242425'. current run reliability is '95.23809523809523'. drift is 4.00433 and the allowed drift is set to 50. The test should PASS
test results on build#67070
test_class test_method test_arguments test_kind job_url test_status passed reason
ArchivalTest test_single_partition_leadership_transfer {"cloud_storage_type": 2} ducktape https://buildkite.com/redpanda/redpanda/builds/67070#0197575d-8a4b-40c9-8f24-ef1e077aa915 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
CloudStorageTimingStressTest test_cloud_storage {"cleanup_policy": "compact,delete"} ducktape https://buildkite.com/redpanda/redpanda/builds/67070#0197575d-8a4b-40c9-8f24-ef1e077aa915 FLAKY 20/21 upstream reliability is '96.0431654676259'. current run reliability is '95.23809523809523'. drift is 0.80507 and the allowed drift is set to 50. The test should PASS
TieredStorageTest test_tiered_storage {"cloud_storage_type_and_url_style": [2, "virtual_host"], "test_case": {"name": "(TS_Read == True, TS_TxRangeMaterialized == True, SpilloverManifestUploaded == True)"}} ducktape https://buildkite.com/redpanda/redpanda/builds/67070#0197576e-0fd5-4ce5-b39c-9f1991bab334 FAIL 0/1 The test has failed across all retries
TieredStorageTest test_tiered_storage {"cloud_storage_type_and_url_style": [1, "virtual_host"], "test_case": {"name": "(TS_Read == True, TS_TxRangeMaterialized == True)"}} ducktape https://buildkite.com/redpanda/redpanda/builds/67070#0197576e-0fd4-4a7e-9b9f-808a0f0b336e FAIL 0/1
RecreateTopicMetadataTest test_recreated_topic_metadata_are_valid {"replication_factor": 3} ducktape https://buildkite.com/redpanda/redpanda/builds/67070#0197576e-0fd5-4ce5-b39c-9f1991bab334 FLAKY 17/21 upstream reliability is '98.7878787878788'. current run reliability is '80.95238095238095'. drift is 17.8355 and the allowed drift is set to 50. The test should PASS
test results on build#68191
test_class test_method test_arguments test_kind job_url test_status passed reason
EndToEndCloudTopicsTest test_write ducktape https://buildkite.com/redpanda/redpanda/builds/68191#0197c1dd-5c75-4d4c-92b1-faa8279c3879 FLAKY 17/21 upstream reliability is '92.06349206349206'. current run reliability is '80.95238095238095'. drift is 11.11111 and the allowed drift is set to 50. The test should PASS
ClusterConfigTest test_valid_settings ducktape https://buildkite.com/redpanda/redpanda/builds/68191#0197c1da-cdfd-4b51-b6a2-b9d55dfb9480 FLAKY 15/21 upstream reliability is '100.0'. current run reliability is '71.42857142857143'. drift is 28.57143 and the allowed drift is set to 50. The test should PASS
ConsumerOffsetsRecoveryTest test_consumer_offsets_partition_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/68191#0197c1dd-5c74-4800-94db-6b8af41ab062 FLAKY 20/21 upstream reliability is '98.78048780487805'. current run reliability is '95.23809523809523'. drift is 3.54239 and the allowed drift is set to 50. The test should PASS
DisablingPartitionsTest test_disable ducktape https://buildkite.com/redpanda/redpanda/builds/68191#0197c1da-cdfb-4ef2-ab31-0bb465a7e038 FLAKY 17/21 upstream reliability is '92.1965317919075'. current run reliability is '80.95238095238095'. drift is 11.24415 and the allowed drift is set to 50. The test should PASS

Comment on lines 80 to 81
request_builder& with_apply_credentials(
ss::lw_shared_ptr<cloud_roles::apply_credentials> apply_creds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does our http client have a bunch of aws specific logic in it? Is there a reason this can't go in a layer above this one or some wrapper around our client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason was so the builder could finalize the http request by doing the signing, but you're right that it's lousy to add this code here. I'll change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrwng had taken steps to split out the http client during the iceberg project, but i suspect that a full separation of concerns would have been too much to make any progress on the separation. it would be great to fully separate things.

Copy link
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too tangled because we assumed that the only cloud service we'll use is s3 and all our configuration is tied to s3...

It would be cool to break the tight dependency between auth_refresh_bg_op and s3. We can probably glance over this short term if there is urgency in shipping it. @andrwng ok to you?

Adding cloud_roles dependency to http client which Tyler raised must be still addressed imho.

@@ -83,6 +83,7 @@ struct gcp_credentials {

std::ostream& operator<<(std::ostream& os, const gcp_credentials& gc);

using aws_service_name = named_type<ss::sstring, struct _aws_service_name>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Any reason this is prefixed with an underscore? That's unusual for Redpanda codebase.

Do you want to propose a new convention? It is probably better done outside of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a goof that got changed later. I will move the fix so the leading underscore is not present in any commit.

@@ -142,14 +145,15 @@ SEASTAR_THREAD_TEST_CASE(test_signature_computation_4) {
cloud_roles::private_key_str secret_key(
"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY");
cloud_roles::aws_region_name region("us-east-1");
cloud_roles::aws_service_name service("s3");
Copy link
Contributor

@nvartolomei nvartolomei Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Add a test that covers non-"s3" service name.
    Some assurance that the implementation is not hard coded.

{.needs_restart = needs_restart::yes, .visibility = visibility::user},
"redpanda-iceberg-catalog")
, iceberg_rest_catalog_base_location(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? I'd expect this to be per-topic 🤔. Can you help me understand the following please:

If you set this to /foo/bar and you have topic named baz, what the s3 layout after data and metadata has been written?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base location is the root at which the catalog will manage the data and metadata for Iceberg tables. So, for a topic test6 and a base location of s3://wdberkeley-test, the layout looks like

s3://wdberkeley-test   # base location
    /redpanda                 # namespace
        /test6                    # table name (named after the topic)
            /data
            /metadata

For all other catalogs we have supported thus far, the catalog will pick a location for the table. The client doesn't have to choose. However, Glue rejects create table requests as malformed if they do not include a (non-empty) location.


namespace datalake {

credential_manager::credential_manager() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too tied to glue and aws to be called generically credential_manager. I can foresee a case where we might want to extend it so probably is fine. But then, can we extract the aws glue specific logic into a function at least (i.e. in anonymous namespace)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that this can be extended, but it's main purpose at the moment is as a home for the background refresh op that keeps AWS credentials up to date. For example, I think we could refactor the catalog client so OAuth credentials are refreshed by the manager.

It's definitely too Glue-specific at the moment. We also have the problem that aws_sigv4 authentication implies the catalog is AWS Glue. Untangling that would mean we need to infer or provide configuration for the AWS service being used with aws_sigv4, as the service name is part of sigv4's signing key. It's abstractly the right thing to do but I'm not sure if there's another option in practice.

I'm interested to hear what you and @andrwng think about configuring the AWS service. In the meantime I'll refactor the manager a bit so it's more obvious how it would generalize to other credentials.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind having something like this to encapsulate these details, but I think the public interface could be refined a bit. As it feels like we're leaking some sigv4 specific details throughout usages of the catalog clients

@nvartolomei
Copy link
Contributor

nvartolomei commented Jun 9, 2025 via email

@kbatuigas kbatuigas self-requested a review June 9, 2025 15:58
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#67070

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[2,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True, SpilloverManifestUploaded == True)"}}

@wdberkeley
Copy link
Contributor Author

/ci-repeat 1
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[2,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True, SpilloverManifestUploaded == True)"}}

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think this generally makes sense, but I've suggested one restructuring that I hope would make it feel a bit more extendible and less invasive

To Nicolae's point, it does read as a little odd to have the credential_manager use the auth_refresh_bg_op, which is used by all cloud types, when really we only care about sigv4 and AWS. I think it'd help make it read more cohesively to move the initialization logic from the catalog factory into the credential manager

// TODO: Implement OAuth2 auth refresh via the bg op.
return std::nullopt;
case config::datalake_catalog_auth_mode::aws_sigv4:
// TODO: Perhaps this setting shouldn't entail using AWS glue?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if this ever becomes a problem, it isn't too hard to add another cluster property, some iceberg_rest_catalog_sigv4_service_name (at least, there's precedence for these auth-type-specific configs, like the token for bearer or key+secret for auth)

One example is that s3tables offers a different REST endpoint that is separate from Glue, but also uses sigv4. If/when we support that, I imagine this code is all reusable except for the service name?

Comment on lines +287 to +279
case config::datalake_catalog_auth_mode::aws_sigv4: {
// AWS SigV4 signing will be handled after build() is called
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a dumb question, but I'm curious if its possible or reasonable for a service to require both oauth and sigv4 signing?

Comment on lines 35 to 37
// Get the current credentials applier, which may be updated asynchronously
// by the background refresh operation
ss::lw_shared_ptr<cloud_roles::apply_credentials> get_credentials() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about having the credential manager sign requests/payloads and having its API be something like:

ss::future<result<...>> maybe_add_token(request&);
ss::future<result<...>> maybe_sign(const std::optional<iobuf>&, request&);

...which would transparently do any background credentials fetching (i.e. the wait_for_credentials_if_needed()) if needed. As is, we're exposing a bunch of credentials details into the catalog factory (e.g. calling wait_for_credentials_if_needed(), and requiring application to call get_credentials()). It feels like if we plumb the credential_manager to the catalog_client and clients, it'd feel a bit less leaky


namespace datalake {

credential_manager::credential_manager() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind having something like this to encapsulate these details, but I think the public interface could be refined a bit. As it feels like we're leaking some sigv4 specific details throughout usages of the catalog clients


// The background refresh op will propagate credentials.
// Static credentials will be available immediately.
while (!apply_credentials_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an abort source. Or see my other suggestion about moving this into the credential manager -- if you go through with that, then there's a nice natural abort source to use

@wdberkeley wdberkeley force-pushed the sigv4_glue branch 2 times, most recently from 2f2d192 to 5badd6b Compare June 26, 2025 16:20
andrwng
andrwng previously approved these changes Jun 26, 2025
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments but LGTM. Happy to merge this and do the others as follow up assuming no other blocking feedback

}
}

ss::future<result<std::monostate>> credential_manager::maybe_sign(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw result<void> is also valid; don't have a preference

@wdberkeley
Copy link
Contributor Author

Force-push to fix final review feedback (all changes applied to second-to-last commit).

andrwng
andrwng previously approved these changes Jun 27, 2025
AWS Signature V4 incorporates the AWS service as part of the
string to sign. Until now the service has always been S3, but
to authenticate requests to Glue our sigV4 implementation must
support multiple services.
AWS signature V4 requires the service as part of the string to sign.
This parameterizes the applier by service.

The only service using the applier is S3. Future changes will
use the generalized credentials to authenticate requests to Glue.
Credentials for signing requests to Glue will also need to be
periodically refreshed. This change updates the credentials refresh
machinery so later it can be used with Glue.

It's a bit ad hoc, but all of this code is bit ad hoc because of how
it's trying to handle multiple cloud providers and mechanisms with a
uniform abstraction.
This changes sigV4 request signing so that it can use a payload hash
instead of a default hash that depends on the verb. This is required to
sign requests to Glue, which does not support UNSIGNED-PAYLOAD requests
(empirically).
This change adds SigV4 authentication as an authentication mode for the
Iceberg REST catalog clients. There are two wrinkles:

1. The request must be signed using a string to sign that depends on
   elements of the request headers and the request payload (if the
   request has one).
2. SigV4 signing credentials are usually temporary and need to be
   refreshed.

The first issue means auth is added just before the request is sent,
when it should be otherwise complete. Bearer auth, by contrast, may be
added to the request earlier.

The second issue is thornier. This change solves it by setting up a
background refresh op like the one used by cloud storage clients. If
credentials are static, they will be set up once; otherwise a background
process will refresh them when they are almost expired.

Because there are multiple catalogs created in multiple places, this
adds a new service that manages the credentials and refreshes them in
the background, like the s3 client pool does for cloud io.

This change adds configuration properties parallel to cloud storage AWS
credentials, and allows the AWS service to be configured, just in case a
user wants to set up different credentials, or use another AWS service
that provides an Iceberg rest catalog (I don't think there are any such
services right now, though).
Glue does not pick a base location for you. You must specify it when
creating tables. This adds a configuration property to support Glue's
behavior. Most REST catalog implementations do not need this property to
be set.
@wdberkeley
Copy link
Contributor Author

Force-push to fix test that was missing new configs from its list.

Comment on lines +4068 to +4122
, iceberg_rest_catalog_aws_access_key(
*this,
"iceberg_rest_catalog_aws_access_key",
"AWS access key for Iceberg REST catalog SigV4 authentication. If not "
"set, falls back to cloud_storage_access_key when using aws_sigv4 "
"authentication mode.",
{.needs_restart = needs_restart::yes,
.visibility = visibility::user,
.gets_restored = gets_restored::no},
std::nullopt,
&validate_non_empty_string_opt)
, iceberg_rest_catalog_aws_secret_key(
*this,
"iceberg_rest_catalog_aws_secret_key",
"AWS secret key for Iceberg REST catalog SigV4 authentication. If not "
"set, falls back to cloud_storage_secret_key when using aws_sigv4 "
"authentication mode.",
{.needs_restart = needs_restart::yes,
.visibility = visibility::user,
.secret = is_secret::yes,
.gets_restored = gets_restored::no},
std::nullopt,
&validate_non_empty_string_opt)
, iceberg_rest_catalog_aws_region(
*this,
"iceberg_rest_catalog_aws_region",
"AWS region for Iceberg REST catalog SigV4 authentication. If not set, "
"falls back to cloud_storage_region when using aws_sigv4 authentication "
"mode.",
{.needs_restart = needs_restart::yes,
.visibility = visibility::user,
.gets_restored = gets_restored::no},
std::nullopt,
&validate_non_empty_string_opt)
, iceberg_rest_catalog_aws_credentials_source(
*this,
"iceberg_rest_catalog_aws_credentials_source",
"Source of AWS credentials for Iceberg REST catalog SigV4 "
"authentication. "
"If not set, falls back to cloud_storage_credentials_source when using "
"aws_sigv4 authentication mode. Accepted values: config_file, "
"aws_instance_metadata, sts, gcp_instance_metadata, "
"azure_vm_instance_metadata, azure_aks_oidc_federation.",
{.needs_restart = needs_restart::yes,
.visibility = visibility::user,
.gets_restored = gets_restored::no},
std::nullopt,
{
model::cloud_credentials_source::config_file,
model::cloud_credentials_source::aws_instance_metadata,
model::cloud_credentials_source::sts,
model::cloud_credentials_source::gcp_instance_metadata,
model::cloud_credentials_source::azure_aks_oidc_federation,
model::cloud_credentials_source::azure_vm_instance_metadata,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mark these as gets_restored::yes (the default), and revert the changes to the ducktape test? The reason the original ones are marked as not getting restored is because they need to be set in the first place in order for a cluster restore to be run, which isn't the case here

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to get the config change as a follow-up so we can get this merged

@wdberkeley wdberkeley merged commit c275225 into redpanda-data:dev Jun 30, 2025
18 checks passed
@tmgstevens
Copy link

@piyushredpanda will this be backported to 25.1 (please!)?

@piyushredpanda
Copy link
Contributor

@piyushredpanda will this be backported to 25.1 (please!)?

Let me chat with the team and get back. It's a large intricate PR.

@lf-rep
Copy link
Contributor

lf-rep commented Jul 1, 2025

@tmgstevens The engineer in charge of this, Will Berkeley, successfully backported and is just doing a sanity check of running a test sequence on the resultant build. Fundamentally the answer here is yes, backporting OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants