mirror of
https://github.com/Mintplex-Labs/tiktoken.git
synced 2026-07-01 18:48:04 -04:00
a1a9f16826f3f2d8ba80b6c5fd270c1c340d6d67
⏳ tiktoken
tiktoken is a fast tokeniser.
import tiktoken
enc = tiktoken.get_encoding("gpt2")
print(enc.encode("hello world"))
The open source version of tiktoken can be installed from PyPI:
pip install tiktoken
The tokeniser API is documented in tiktoken/core.py.
Performance
tiktoken is between 3-6x faster than huggingface's tokeniser:
Performance measured on 1GB of text using the GPT-2 tokeniser, using GPT2TokenizerFast from
tokenizers==0.13.2 and transformers==4.24.0.
Languages
Python
34.4%
TypeScript
33.6%
Rust
31.9%