[Bug Fix] Token encoding inconsistency between encode and token_counter #13907

eycjur · 2025-08-23T06:24:04Z

Title

Fix token encoding inconsistency between encode and token_counter

Relevant issues

No relevant issues. But this PR resolves inconsistencies in token counting between the encode functions and token_counter, particularly for openai models. The fix ensures that both systems use the correct model-specific tokenizers, eliminating discrepancies and improving accuracy for multilingual text.

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
I have added a screenshot of my new test passing locally
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🐛 Bug Fix

Changes

Modified the _return_openai_tokenizer in encode process to return a tokenizer that considers the model.
In line with the above change, removed the tokenizer selection logic from _get_count_function since it's assumed to be unnecessary (internally, _select_tokenizer → _select_tokenizer_helper → _return_openai_tokenizer are called in sequence).

vercel · 2025-08-23T06:24:09Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
litellm	Ready	Preview	Comment	Aug 23, 2025 6:37am

eycjur · 2025-08-23T06:26:12Z

litellm/utils.py

+
+    try:
+        # Special case: GPT-4o uses o200k_base
+        if "gpt-4o" in model_to_use:


I am not particularly knowledgeable about this, but I am performing special processing for gpt-4o in line with the original processing of count_tokens. However, if tktoken can handle it appropriately, this may not be necessary.
https://github.com/openai/tiktoken/blob/eedc856364506a9d4651645a0290eb0ba81e6935/tiktoken/model.py#L36

eycjur · 2025-08-23T06:27:26Z

litellm/litellm_core_utils/token_counter.py

@@ -518,15 +518,7 @@ def count_tokens(text: str) -> int:
                return len(enc.ids)

        elif tokenizer_json["type"] == "openai_tokenizer":
-            model_to_use = _fix_model_name(model)  # type: ignore


Since it is called as _select_tokenizer → _select_tokenizer_helper → _return_openai_tokenizer, I thought it would be okay to delete this part from here when performing this processing in _return_openai_tokenizer.

eycjur added 2 commits August 23, 2025 14:48

align encode to token_counter

9bc8231

add test to check consistency

51b3946

vercel bot deployed to Preview August 23, 2025 06:25 View deployment

refactor

6a30c5f

eycjur commented Aug 23, 2025

View reviewed changes

vercel bot deployed to Preview August 23, 2025 06:30 View deployment

fix lint

b4cfb8a

vercel bot deployed to Preview August 23, 2025 06:37 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug Fix] Token encoding inconsistency between encode and token_counter #13907

[Bug Fix] Token encoding inconsistency between encode and token_counter #13907

Uh oh!

eycjur commented Aug 23, 2025

Uh oh!

vercel bot commented Aug 23, 2025 •

edited

Loading

Uh oh!

eycjur Aug 23, 2025

Uh oh!

eycjur Aug 23, 2025

Uh oh!

Uh oh!

Uh oh!

[Bug Fix] Token encoding inconsistency between encode and token_counter #13907

Are you sure you want to change the base?

[Bug Fix] Token encoding inconsistency between encode and token_counter #13907

Uh oh!

Conversation

eycjur commented Aug 23, 2025

Title

Relevant issues

Pre-Submission checklist

Type

Changes

Uh oh!

vercel bot commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eycjur Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

eycjur Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vercel bot commented Aug 23, 2025 •

edited

Loading