You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’ve run into a performance issue when serializing large tables in docling_core/transforms/serializer/html.py.
In HTMLTableSerializer.serialize, the code loops over the grid like this:
for i in range(nrows):
body += "<tr>"
for j in range(ncols):
cell: TableCell = item.data.grid[i][j]
...
For big tables (thousands of rows × many columns), this nested indexing becomes the main bottleneck — profiling shows it dominates the conversion time.
Suggestion:
Cache the grid access (row-level or full-grid) to avoid repeated lookups. For example:
for row in item.data.grid:
body += "<tr>"
for cell in row:
...
Alternatively, pre-extract the needed TableCell fields into a lighter structure before looping.
Would you be open to a PR with this refactor? It should speed up large table serialization a lot.