Skip to content

Conversation

dpbutter
Copy link
Contributor

@dpbutter dpbutter commented Aug 6, 2025

This is a reworking of a small part of Ex_comparator to make it faster. (I measure ~2.6x speedup on general workflows, up to 4x speedup on very elaborate substitutes with many rules. For example, one typical substitute went from 27 s to 7 s.)

The underlying comparator used replacement_map, mapping Ex objects to Ex objects. These are created and destroyed many times in involved computations. These only differ from existing objects by a small number of operations (modifying parent relations or eliminating children). To save on clock cycles, I've introduced Lazy_Ex, which is a glorified wrapper around an Ex::iterator and a flag indicating what modification to perform. The idea is to avoid Ex creation as much as possible.

The Lazy_Ex provides a resolve() routine which will apply the operation and return a new Ex only when needed.

However, most of the objects in replacement_map are just passed to subtree_compare. So I modified subtree_compare to natively accept Lazy_Ex objects, applying the flagged operation during the tree comparison. So in practice, creation of such a new Ex is minimized.

@dpbutter
Copy link
Contributor Author

dpbutter commented Aug 7, 2025

P.S. The penultimate commit includes an entire testing apparatus where I compared existing runs of compare with the old and new replacement maps being used side-by-side, to make sure nothing different happened.

@kpeeters
Copy link
Owner

kpeeters commented Aug 7, 2025

What does the typical Ex look like for which you are trying to optimise? Is it one head node with a few children? Or just single nodes without children at all?

@dpbutter
Copy link
Contributor Author

dpbutter commented Aug 7, 2025

I'll take a deeper look at this later and try to profile them. (I could also send you the scripts I'm using if it would be useful.)

I admit I was quite surprised by the cost of Ex creation and how much of a benefit revising replacement_map did. My solution does seem a kludge; probably there is a better/smarter way of circumventing it, but would require a greater rewrite of your existing code logic, which I wanted to avoid.

One possibility, which I didn't explore, is that the real time issue is with multiplier insertion into rat_set. I ran perf report on a long workflow and a sizeable amount of time was spent in Multiplier::operator<. I suspect this is may be to calls to the str_node constructor, which then tries to insert an existing multiplier into rat_set over and over again.

@kpeeters
Copy link
Owner

kpeeters commented Aug 7, 2025

I made some changes recently to Multiplier which may have removed this bottleneck (it's in 2.5.14).

@dpbutter
Copy link
Contributor Author

dpbutter commented Aug 7, 2025

My original data that I compared against was indeed from 2.5.12. Repeating it for 2.5.14 helps a bit -- that takes the 27s operation I mentioned above to 23s. So I guess Lazy_Ex circumvents more than just Multiplier, because it still kicks that down to 7s. I should benchmark how many Ex creations happen in my operation, but I haven't done so yet.

@dpbutter
Copy link
Contributor Author

dpbutter commented Aug 15, 2025

I collected some statistics for typical Ex objects in a long run. Below I give a dictionary where the key is the size of the Ex object that would have been created and the value is the count of such objects. Unsurprisingly, it is highly dominated by objects with one node (presumably indices or objects with erased children).

1: 115746460
2: 2076
3: 43964
4: 34188
5: 64196
7: 7842
8: 16
9: 212
10: 82
11: 380
13: 248
14: 192
15: 764
16: 144
17: 620
18: 580
19: 1280
20: 544
21: 512
22: 644
23: 180
24: 250
25: 324
26: 112
27: 272
28: 188
29: 4
31: 216
32: 252
34: 72
85: 32
90: 32
96: 94
101: 94
625: 36
630: 36
932: 24
937: 24
1484: 4
1738: 4
3619: 12
3624: 12

@kpeeters
Copy link
Owner

That suggests that there is potentially a lot to be gained if we can make single-node Ex creation and destruction faster. The Ex copy constructor is pretty expensive right now for single nodes. First of all, it allocates not just 1 node but 3, as there is always the head and feet nodes of the tree. And then the copy_ function is not particularly clever, because it first sets up the top-level node, and then replaces that node with a copy of itself (I think I wrote replace first and then realised that this was a quick way to get a full copy, but it's pretty stupid). So that's another alloc/dealloc. So it does 5 alloc/dealloc operations in total, instead of just 1. I'll take a shot at removing that overhead, because it's idiotic (for what I should have realised will always be the most relevant case by a large margin).

Of course your PR avoids more overhead, in particular the overhead of the slow operator< for str_node, which needs to compare strings and multipliers, while you just compare pointers. So I think it's still useful to merge this. I just need to make sure that this is not going to introduce complexity which will make life harder further down the road.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants