Skip to content

Conversation

eycjur
Copy link
Contributor

@eycjur eycjur commented Aug 23, 2025

Title

Fix token encoding inconsistency between encode and token_counter

Relevant issues

No relevant issues. But this PR resolves inconsistencies in token counting between the encode functions and token_counter, particularly for openai models. The fix ensures that both systems use the correct model-specific tokenizers, eliminating discrepancies and improving accuracy for multilingual text.

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🐛 Bug Fix

Changes

  • Modified the _return_openai_tokenizer in encode process to return a tokenizer that considers the model.
  • In line with the above change, removed the tokenizer selection logic from _get_count_function since it's assumed to be unnecessary (internally, _select_tokenizer → _select_tokenizer_helper → _return_openai_tokenizer are called in sequence).

Copy link

vercel bot commented Aug 23, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
litellm Ready Ready Preview Comment Aug 23, 2025 6:37am


try:
# Special case: GPT-4o uses o200k_base
if "gpt-4o" in model_to_use:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not particularly knowledgeable about this, but I am performing special processing for gpt-4o in line with the original processing of count_tokens. However, if tktoken can handle it appropriately, this may not be necessary.
https://github.com/openai/tiktoken/blob/eedc856364506a9d4651645a0290eb0ba81e6935/tiktoken/model.py#L36

@@ -518,15 +518,7 @@ def count_tokens(text: str) -> int:
return len(enc.ids)

elif tokenizer_json["type"] == "openai_tokenizer":
model_to_use = _fix_model_name(model) # type: ignore
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is called as _select_tokenizer → _select_tokenizer_helper → _return_openai_tokenizer, I thought it would be okay to delete this part from here when performing this processing in _return_openai_tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant