ray.data.from_huggingface#

ray.data.from_huggingface(dataset: datasets.Dataset) ray.data.dataset.MaterializedDataset[source]#

Create a Dataset from a Hugging Face Datasets Dataset.

This function isn’t parallelized, and is intended to be used with Hugging Face Datasets that are loaded into memory (as opposed to memory-mapped).

Example

import ray
import datasets

hf_dataset = datasets.load_dataset("tweet_eval", "emotion")
ray_ds = ray.data.from_huggingface(hf_dataset["train"])
print(ray_ds)
MaterializedDataset(
    num_blocks=...,
    num_rows=3257,
    schema={text: string, label: int64}
)
Parameters

dataset – A Hugging Face Datasets Dataset. IterableDataset and DatasetDict are not supported.

Returns

A Dataset holding rows from the Hugging Face Datasets Dataset.

PublicAPI: This API is stable across Ray releases.