-
Notifications
You must be signed in to change notification settings - Fork 30.4k
remove the redundant non maintained jieba to rjieba #40383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove the redundant non maintained jieba to rjieba #40383
Conversation
48158ef
to
6a85fa7
Compare
45c21cd
to
6297bfa
Compare
cc: @ydshieh |
5cc73b4
to
5dacc79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@@ -223,7 +223,7 @@ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = | |||
|
|||
def _batch_encode_plus(self, batch_text_or_text_pairs, *args, **kwargs): | |||
batch_text_or_text_pairs = [ | |||
" ".join([x.translate(self.translator) for x in self.jieba.cut(text, cut_all=False)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to remove cut_all=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah because I couldn't find the cut function with the cut_all parameter,
rjieba has a separate function :
rjieba.cut_all for cut_all=True as I could infer from the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is """Runs pre-tokenization with Jieba segmentation tool. It is used in CPM models.""".
Do you want to change it to Rust Jieba
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with Jieba-RS
instead of Rust Jieba
to make it consistent with other. Or do you prefer Rust Jieba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's ok, thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thank you a lot!
The removal of some symbol definitions may not be ideal, but I think we won't get much complain. Let's see.
[For maintainers] Suggested jobs to run (before merge) run-slow: cpm, cpmant, xlm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again. I will merge after building the image and check nothing breaks
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Unfortunately, there is some serious issue when I built the CI docker images based on this PR It's unclear to me what happens, and when I switch back to the images built on main branch, it doesn't have this kind of error (I cancelled the jobs at some point). We can wait on Monday to see if the newly built images (based on main branch) during the weekend continue to work. In that case, it means this PR could not merged as it, and need a deeper dive. |
@ydshieh I see. On a superficial look, the issues seems unrelated to the changes in the PR, but I guess, we can wait and see and then I can dive deeper. |
@divyanshsinghvi Good news, it's not from this PR but from pytest-dev/pytest-rerunfailures#303 I am pinging it to 15.1 and can merge this PR afterward 🎉 |
Ah interesting, how did you figure it out, through searching the error and finding that it was similar to what others mentioned on pytest-dev issues, or were you able to point it out to pytest-dev from the error logs? Such rare bugs are difficult to find. |
Fixes #40239
The commit moves away old dependency on jieba to a much efficient rjieba which is a python binding for jieba-rs (jieba implemented in rust)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ydshieh @Rocketknight1