Skip to content

Raise an error in the Index init in case of incompatible vocab #225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RobinPicard
Copy link
Contributor

Closes #224

The problem described in the issue above is caused by an incompatibility between the vocabulary provided by the user and the regex used to create the DFA. This issue is typically caused by a wrongful encoding of the the vocabulary by the user (special tokens from their tokenizer are included).

This PR proposes to do 2 things about it:

  • Update the definition of the Index object to raise an error at initialization if such an incompatibility exist (instead of leading to a situation in which an error could arise during inference)
  • Update the README to give more information on how to create a Vocabulary and warn users about this problem

@RobinPicard RobinPicard requested a review from rlouf August 4, 2025 14:37
rlouf
rlouf previously approved these changes Aug 4, 2025
@RobinPicard RobinPicard force-pushed the raise_error_incompatible_vocab branch from aba7578 to b6a5133 Compare August 4, 2025 20:11
@RobinPicard RobinPicard enabled auto-merge (rebase) August 4, 2025 20:27
@RobinPicard RobinPicard requested a review from rlouf August 4, 2025 20:31
@RobinPicard RobinPicard self-assigned this Aug 4, 2025
@RobinPicard RobinPicard force-pushed the raise_error_incompatible_vocab branch from 9a3aae1 to b6a5133 Compare August 5, 2025 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"No next state found for the current state" error
2 participants