ray.data.preprocessors.Tokenizer#

class ray.data.preprocessors.Tokenizer(columns: List[str], tokenization_fn: Optional[Callable[[str], List[str]]] = None)[source]#

Bases: ray.data.preprocessor.Preprocessor

Replace each string with a list of tokens.

Examples

>>> import pandas as pd
>>> import ray
>>> df = pd.DataFrame({"text": ["Hello, world!", "foo bar\nbaz"]})
>>> ds = ray.data.from_pandas(df)  

The default tokenization_fn delimits strings using the space character.

>>> from ray.data.preprocessors import Tokenizer
>>> tokenizer = Tokenizer(columns=["text"])
>>> tokenizer.transform(ds).to_pandas()  
               text
0  [Hello,, world!]
1   [foo, bar\nbaz]

If the default logic isn’t adequate for your use case, you can specify a custom tokenization_fn.

>>> import string
>>> def tokenization_fn(s):
...     for character in string.punctuation:
...         s = s.replace(character, "")
...     return s.split()
>>> tokenizer = Tokenizer(columns=["text"], tokenization_fn=tokenization_fn)
>>> tokenizer.transform(ds).to_pandas()  
              text
0   [Hello, world]
1  [foo, bar, baz]
Parameters
  • columns – The columns to tokenize.

  • tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to lambda s: s.split(" ").

PublicAPI (alpha): This API is in alpha and may change before becoming stable.

Methods

fit(ds)

Fit this Preprocessor to the Dataset.

fit_transform(ds)

Fit this Preprocessor to the Dataset and then transform the Dataset.

preferred_batch_format()

Batch format hint for upstream producers to try yielding best block format.

transform(ds)

Transform the given dataset.

transform_batch(data)

Transform a single batch of data.

transform_stats()

Return Dataset stats for the most recent transform call, if any.