-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Closed
Description
Describe the bug
When using IterableDataset.from_spark()
with a Spark DataFrame containing image data, the Image
feature class fails to properly process this data type, causing an AttributeError: 'bytearray' object has no attribute 'get'
Steps to reproduce the bug
- Create a Spark DataFrame with a column containing image data as bytearray objects
- Define a Feature schema with an Image feature
- Create an IterableDataset using
IterableDataset.from_spark()
- Attempt to iterate through the dataset
from pyspark.sql import SparkSession
from datasets import Dataset, IterableDataset, Features, Image, Value
# initialize spark
spark = SparkSession.builder.appName("MinimalRepro").getOrCreate()
# create spark dataframe
data = [(0, open("image.png", "rb").read())]
df = spark.createDataFrame(data, "idx: int, image: binary")
# convert to dataset
features = Features({"idx": Value("int64"), "image": Image()})
ds = Dataset.from_spark(df, features=features)
ds_iter = IterableDataset.from_spark(df, features=features)
# iterate
print(next(iter(ds)))
print(next(iter(ds_iter)))
Expected behavior
The features should work on IterableDataset
the same way they work on Dataset
Environment info
datasets
version: 3.5.0- Platform: macOS-15.3.2-arm64-arm-64bit
- Python version: 3.12.7
huggingface_hub
version: 0.30.2- PyArrow version: 18.1.0
- Pandas version: 2.2.3
fsspec
version: 2024.12.0
Metadata
Metadata
Assignees
Labels
No labels