Skip to content

Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames #7517

@giraffacarp

Description

@giraffacarp

Describe the bug

When using IterableDataset.from_spark() with a Spark DataFrame containing image data, the Image feature class fails to properly process this data type, causing an AttributeError: 'bytearray' object has no attribute 'get'

Steps to reproduce the bug

  1. Create a Spark DataFrame with a column containing image data as bytearray objects
  2. Define a Feature schema with an Image feature
  3. Create an IterableDataset using IterableDataset.from_spark()
  4. Attempt to iterate through the dataset
from pyspark.sql import SparkSession
from datasets import Dataset, IterableDataset, Features, Image, Value

# initialize spark
spark = SparkSession.builder.appName("MinimalRepro").getOrCreate()

# create spark dataframe
data = [(0, open("image.png", "rb").read())]
df = spark.createDataFrame(data, "idx: int, image: binary")

# convert to dataset
features = Features({"idx": Value("int64"), "image": Image()})
ds = Dataset.from_spark(df, features=features)
ds_iter = IterableDataset.from_spark(df, features=features)

# iterate
print(next(iter(ds)))
print(next(iter(ds_iter)))

Expected behavior

The features should work on IterableDataset the same way they work on Dataset

Environment info

  • datasets version: 3.5.0
  • Platform: macOS-15.3.2-arm64-arm-64bit
  • Python version: 3.12.7
  • huggingface_hub version: 0.30.2
  • PyArrow version: 18.1.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions