2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00
2022-12-14 15:33:00 -08:00

tiktoken

tiktoken is a fast tokeniser.

import tiktoken
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("hello world"))

The open source version of tiktoken can be installed from PyPI:

pip install tiktoken

The tokeniser API is documented in tiktoken/core.py.

Performance

tiktoken is between 3-6x faster than huggingface's tokeniser:

image

Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast from tokenizers==0.13.2 and transformers==4.24.0.

S
Description
JS port and JS/WASM bindings for openai/tiktoken
Readme MIT 563 KiB
Languages
Python 34.4%
TypeScript 33.6%
Rust 31.9%