Skip to content

Conversation

divyanshsinghvi
Copy link
Contributor

@divyanshsinghvi divyanshsinghvi commented Aug 22, 2025

Fixes #40239
The commit moves away old dependency on jieba to a much efficient rjieba which is a python binding for jieba-rs (jieba implemented in rust)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ydshieh @Rocketknight1

@divyanshsinghvi divyanshsinghvi force-pushed the porting_jieba_dependency_to_rjieba branch from 48158ef to 6a85fa7 Compare August 22, 2025 20:12
@divyanshsinghvi divyanshsinghvi force-pushed the porting_jieba_dependency_to_rjieba branch from 45c21cd to 6297bfa Compare August 25, 2025 15:01
@divyanshsinghvi
Copy link
Contributor Author

cc: @ydshieh
Let me know if something else needs to be done for the same?

Copy link

@Ed-uardo Ed-uardo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@@ -223,7 +223,7 @@ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] =

def _batch_encode_plus(self, batch_text_or_text_pairs, *args, **kwargs):
batch_text_or_text_pairs = [
" ".join([x.translate(self.translator) for x in self.jieba.cut(text, cut_all=False)])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to remove cut_all=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah because I couldn't find the cut function with the cut_all parameter,

rjieba has a separate function :
rjieba.cut_all for cut_all=True as I could infer from the docs

https://github.com/messense/rjieba-py/blob/25690028c59c69df615ec38351f51bb5d77f3286/tests/test_rjieba.py#L5

https://hexdocs.pm/jieba/Jieba.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is """Runs pre-tokenization with Jieba segmentation tool. It is used in CPM models.""".

Do you want to change it to Rust Jieba here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with Jieba-RS instead of Rust Jieba to make it consistent with other. Or do you prefer Rust Jieba

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok, thank you

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you a lot!

The removal of some symbol definitions may not be ideal, but I think we won't get much complain. Let's see.

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: cpm, cpmant, xlm

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again. I will merge after building the image and check nothing breaks

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 29, 2025

Unfortunately, there is some serious issue when I built the CI docker images based on this PR

https://app.circleci.com/pipelines/github/huggingface/transformers/144302/workflows/639ba8ce-db91-4343-936c-1b71502fe039/jobs/1907652

It's unclear to me what happens, and when I switch back to the images built on main branch, it doesn't have this kind of error

https://app.circleci.com/pipelines/github/huggingface/transformers/144306/workflows/bf945811-ea05-4457-8e4e-e2e7c2b0bc40/jobs/1907682

(I cancelled the jobs at some point).

We can wait on Monday to see if the newly built images (based on main branch) during the weekend continue to work. In that case, it means this PR could not merged as it, and need a deeper dive.

@divyanshsinghvi
Copy link
Contributor Author

Unfortunately, there is some serious issue when I built the CI docker images based on this PR

https://app.circleci.com/pipelines/github/huggingface/transformers/144302/workflows/639ba8ce-db91-4343-936c-1b71502fe039/jobs/1907652

It's unclear to me what happens, and when I switch back to the images built on main branch, it doesn't have this kind of error

https://app.circleci.com/pipelines/github/huggingface/transformers/144306/workflows/bf945811-ea05-4457-8e4e-e2e7c2b0bc40/jobs/1907682

(I cancelled the jobs at some point).

We can wait on Monday to see if the newly built images (based on main branch) during the weekend continue to work. In that case, it means this PR could not merged as it, and need a deeper dive.

@ydshieh I see. On a superficial look, the issues seems unrelated to the changes in the PR, but I guess, we can wait and see and then I can dive deeper.

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 30, 2025

@divyanshsinghvi Good news, it's not from this PR but from pytest-dev/pytest-rerunfailures#303

I am pinging it to 15.1 and can merge this PR afterward 🎉

@ydshieh ydshieh merged commit 1363fce into huggingface:main Aug 30, 2025
24 checks passed
@divyanshsinghvi
Copy link
Contributor Author

@divyanshsinghvi Good news, it's not from this PR but from pytest-dev/pytest-rerunfailures#303

I am pinging it to 15.1 and can merge this PR afterward 🎉

Ah interesting, how did you figure it out, through searching the error and finding that it was similar to what others mentioned on pytest-dev issues, or were you able to point it out to pytest-dev from the error logs? Such rare bugs are difficult to find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Please remove the redundant dependency jieba: rjieba does the same and has better performance
4 participants