Skip to content

Successful import-url / push resulted in truncated cache file #10767

@ryan-williams

Description

@ryan-williams

Bug Report

Description

Here is a GitHub Action that ran:

dvc import-url s3://tripdata/202505-citibike-tripdata.zip s3/tripdata/202505-citibike-tripdata.zip
# Importing 's3://tripdata/202505-citibike-tripdata.zip' -> 's3/tripdata/202505-citibike-tripdata.zip'
dvc import-url s3://tripdata/JC-202505-citibike-tripdata.csv.zip s3/tripdata/JC-202505-citibike-tripdata.csv.zip
# Importing 's3://tripdata/JC-202505-citibike-tripdata.csv.zip' -> 's3/tripdata/JC-202505-citibike-tripdata.csv.zip'
dvc push
# 2 files pushed

However, the first imported file (s3/tripdata/202505-citibike-tripdata.zip) ended up truncated, in my S3 remote cache.

I backed up the truncated blob with a .bad suffix, and then manually fixed the blob in the cache (with aws s3 cp, dvc add, dvc push):

aws s3 ls s3://ctbk/.dvc/files/md5/9e/880ca091cc946d563ea4b115ec443e
# 2025-06-06 19:44:58  844607858 880ca091cc946d563ea4b115ec443e
# 2025-06-06 19:39:50  838860800 880ca091cc946d563ea4b115ec443e.bad

Verifying that 9e/880ca091cc946d563ea4b115ec443e.bad is a prefix of the full blob:

aws s3 cp s3://ctbk/.dvc/files/md5/9e/880ca091cc946d563ea4b115ec443e.bad - | md5sum
# ef7b7328a690dfdc9858c2da4cad9f41  -
bad_size="$(aws s3 ls s3://ctbk/.dvc/files/md5/9e/880ca091cc946d563ea4b115ec443e.bad | awk '{print $3}')"; echo $bad_size
# 838860800
aws s3 cp s3://ctbk/.dvc/files/md5/9e/880ca091cc946d563ea4b115ec443e - 2>/dev/null | head -c "$bad_size" | md5sum
# ef7b7328a690dfdc9858c2da4cad9f41  -

Reproduce

I'm guessing it was a transient issue in my GHA run. I haven't tried to reproduce it.

I'm not sure which one failed here:

  • It could be that import-url failed, dvc push happily pushed the truncated blob
  • Or import-url may have been fine, but push silently failed to complete.

Expected

If import-url or push fails to import or push a full file, the command should exit non-zero, and some errors should be logged.

Environment information

You can see everything in the the GHA:

Metadata

Metadata

Assignees

No one assigned

    Labels

    fs: s3Related to the S3 filesystem

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions