Fast tokenizers are fast, but they also have additional features to map the tokens to the words they come from or the original span of characters in the raw text. This video explores these features.

This video is part of the Hugging Face course: http://huggingface.co/course
Open in colab to run the code samples:
https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/videos/offset_mapping.ipynb

Related videos:
– Why are fast tokenizers called fast? — https://youtu.be/g8quOxoqhHQ
– Training a new tokenizer: https://youtu.be/DJimQynXZsQ

Don’t have a Hugging Face account? Join now: http://huggingface.co/join
Have a question? Checkout the forums: https://discuss.huggingface.co/c/course/20
Subscribe to our newsletter: https://huggingface.curated.co/

Add comment

Your email address will not be published. Required fields are marked *

Categories

All Topics