Skip to content

Conversation

zleyyij
Copy link
Contributor

@zleyyij zleyyij commented Jul 21, 2025

At the time of opening it's not quite in a state where it's ready to be merged, but I want you to be able to track progress and provide feedback.

My roadmap looks like this:

  • Design and implement an interface for churning through a lot of files without loading them all into memory at the same time
  • Split text into epochs
  • Score each segment in an epoch using a reservoir sampler to estimate frequency size within the epoch
  • determine the best order to concatenate k-mers to form the dictionary (likely simple)
  • Implement zstd format wrappers around the dictionary
  • Create user-facing methods for creating a new dictionary
  • Create rough benchmarks to ensure that created dictionaries are effective and creation is usably fast for enwik9
  • Add dict creation to the CLI

Once that's done, then I could begin improving dictionary creation in a separate PR.

@zleyyij
Copy link
Contributor Author

zleyyij commented Jul 21, 2025

Consider it a draft for the time being

@KillingSpark
Copy link
Owner

That's very cool! Thanks for the update on your progress :)

@zleyyij
Copy link
Contributor Author

zleyyij commented Aug 11, 2025

Update: the core algorithm is complete, now all that's left is cleanup, and possibly performance/efficiency improvements in a different PR.

Bench marking on the github-users sample set, we achieve compression gains within a percentage of the original implementation.

uncompressed: 100.00% (7484607 bytes)
no dict: 34.99% of original size (2618872 bytes)
reference dict: 16.16% of no dict size (2195672 bytes smaller)
our dict: 16.28% of no dict size (2192400 bytes smaller)

@zleyyij
Copy link
Contributor Author

zleyyij commented Aug 19, 2025

@KillingSpark this is ready for review :)

@KillingSpark
Copy link
Owner

Nice! I'll do a complete review in a bit, a few small things I noticed while skimming the change set:

  • Typo introduced in Cargo.toml the -> theea
  • lib.rs VERBOSE should not be set to true
  • The big chunk of commented out code in zstd_dict.rs should either be used or deleted

It'll probably be a few days until I find the time to do a full review, but I'm exited to have this functionality merged soon :)

- Fix typo in cargo.toml
- set VERBOSE to false and add a test to verify it's false
- remove commented out bench code from zstd_dict.rs
@KillingSpark
Copy link
Owner

Hi, finally found the time to review. This looks great! One thing needs fixing before I can merge. The change in lib.rs where you removed the mod tests; and included your own small tests module. I like the test for verbose_disabled though, you could copy that in the tests/mod.rs. But removing mod tests; results in the tests not running.

And one more nitpick, which I wouldn't bother you with if it was just that: In Cargo.toml there's still a typo changing, the -> thee.

Otherwise great work, I'll gladly merge it with the issue above resolved :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants