ray.data.preprocessors.Tokenizer
ray.data.preprocessors.Tokenizer#
- class ray.data.preprocessors.Tokenizer(columns: List[str], tokenization_fn: Optional[Callable[[str], List[str]]] = None)[source]#
Bases:
ray.data.preprocessor.PreprocessorReplace each string with a list of tokens.
Examples
>>> import pandas as pd >>> import ray >>> df = pd.DataFrame({"text": ["Hello, world!", "foo bar\nbaz"]}) >>> ds = ray.data.from_pandas(df)
The default
tokenization_fndelimits strings using the space character.>>> from ray.data.preprocessors import Tokenizer >>> tokenizer = Tokenizer(columns=["text"]) >>> tokenizer.transform(ds).to_pandas() text 0 [Hello,, world!] 1 [foo, bar\nbaz]
If the default logic isn’t adequate for your use case, you can specify a custom
tokenization_fn.>>> import string >>> def tokenization_fn(s): ... for character in string.punctuation: ... s = s.replace(character, "") ... return s.split() >>> tokenizer = Tokenizer(columns=["text"], tokenization_fn=tokenization_fn) >>> tokenizer.transform(ds).to_pandas() text 0 [Hello, world] 1 [foo, bar, baz]
- Parameters
columns – The columns to tokenize.
tokenization_fn – The function used to generate tokens. This function should accept a string as input and return a list of tokens as output. If unspecified, the tokenizer uses a function equivalent to
lambda s: s.split(" ").
PublicAPI (alpha): This API is in alpha and may change before becoming stable.
Methods
fit(ds)Fit this Preprocessor to the Dataset.
fit_transform(ds)Fit this Preprocessor to the Dataset and then transform the Dataset.
Batch format hint for upstream producers to try yielding best block format.
transform(ds)Transform the given dataset.
transform_batch(data)Transform a single batch of data.
Return Dataset stats for the most recent transform call, if any.