Skip to content

Conversation

AswanthManoj
Copy link
Contributor

Inspired by https://jina.ai/tokenizer/#chunking which leverage common structural cues and build a set of rules and heuristics which should perform exceptionally well across diverse types of content, including Markdown, HTML, LaTeX, and more, ensuring accurate segmentation of text into meaningful chunks.

Reference: https://gist.github.com/JeremiahZhang/2f8ae87dad836b25f40c02b8c43d16ec
Original x post: https://x.com/JinaAI_/status/1823756993108304135

@AswanthManoj AswanthManoj changed the title Added structural cue chunking strategy based on JinaAI's tokenizer ch… Add structural cue chunking based inspired by JinaAI's implementation Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant