Raise an error in the Index init in case of incompatible vocab #225
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #224
The problem described in the issue above is caused by an incompatibility between the vocabulary provided by the user and the regex used to create the DFA. This issue is typically caused by a wrongful encoding of the the vocabulary by the user (special tokens from their tokenizer are included).
This PR proposes to do 2 things about it:
Index
object to raise an error at initialization if such an incompatibility exist (instead of leading to a situation in which an error could arise during inference)Vocabulary
and warn users about this problem