Skip to content

Performance bottleneck in HTML serializer due to repeated grid cell access #372

@ddoron9

Description

@ddoron9

Hi,

I’ve run into a performance issue when serializing large tables in docling_core/transforms/serializer/html.py.

In HTMLTableSerializer.serialize, the code loops over the grid like this:

for i in range(nrows):
    body += "<tr>"
    for j in range(ncols):
        cell: TableCell = item.data.grid[i][j]
        ...

For big tables (thousands of rows × many columns), this nested indexing becomes the main bottleneck — profiling shows it dominates the conversion time.

Suggestion:
Cache the grid access (row-level or full-grid) to avoid repeated lookups. For example:

for row in item.data.grid:
    body += "<tr>"
    for cell in row:
        ...

Alternatively, pre-extract the needed TableCell fields into a lighter structure before looping.
Would you be open to a PR with this refactor? It should speed up large table serialization a lot.

Thanks for your work! 🚀

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions