Fix backup retry behavior #543

tstirrat15 · 2025-09-02T21:54:57Z

Fixes #542

Description

Currently, ExportBulk is on the list of retryable methods. That means that when an export stream errors for any reason, the request is re-issued. This is a problem because it starts at the beginning and the relations are appended naively to the output. Ideally we'd resume from the place where the error occurred, and that's what this PR implements.

TODO

Test
Figure out if the stream needs to be manually closed

Changes

Add ExportBulk to the list of things that doesn't get retried
Add manual retryable error handling that overwrites the existing stream

Testing

Review. See that tests pass.

tstirrat15

See comments

tstirrat15 · 2025-09-08T22:56:11Z

internal/client/client.go

+		// retrying the bulk import in backup/restore logic is handled manually.
+		// retrying bulk export is also handled manually, because the default behavior is
+		// to start at the beginning of the stream, which produces duplicate values.
+		selector.StreamClientInterceptor(retry.StreamClientInterceptor(retryOpts...), selector.MatchFunc(isNoneOf(importBulkRoute, exportBulkRoute))),


This is half of the fix - we don't want to automatically retry ExportBulk requests, because then we're not properly handling restarting the stream.

internal/cmd/backup.go

tstirrat15 · 2025-09-08T22:57:25Z

internal/cmd/backup.go

+				// Clone the request to ensure that we are keeping all other fields the same
+				newReq := req.CloneVT()
+				newReq.OptionalCursor = lastResponse.AfterResultCursor
+
+				relationshipStream, err = spiceClient.ExportBulkRelationships(ctx, newReq)
+				log.Info().Err(err).Str("cursor token", lastResponse.AfterResultCursor.Token).Msg("encountered retryable error, resuming stream after token")


This is the other half of the fix - manually catching retryable errors and ensuring that we resume the stream in the correct spot.

internal/cmd/backup.go

tstirrat15 · 2025-09-08T22:59:19Z

internal/cmd/backup_test.go

+func (m *mockClientForBackup) Recv() (*v1.ExportBulkRelationshipsResponse, error) {
+	// If we've run through all our calls, return an EOF
+	if m.recvCallIndex == len(m.recvCalls) {
+		return nil, io.EOF
+	}
+	recvCall := m.recvCalls[m.recvCallIndex]
+	m.recvCallIndex++
+	return recvCall()
+}
+
+func (m *mockClientForBackup) ExportBulkRelationships(_ context.Context, req *v1.ExportBulkRelationshipsRequest, _ ...grpc.CallOption) (grpc.ServerStreamingClient[v1.ExportBulkRelationshipsResponse], error) {
+	if m.exportCalls == nil {
+		// If the caller doesn't supply exportCalls, pass through
+		return m, nil
+	}
+	if m.exportCallsIndex == len(m.exportCalls) {
+		// If invoked too many times, fail the test
+		m.t.FailNow()
+		return m, nil
+	}
+	exportCall := m.exportCalls[m.exportCallsIndex]
+	m.exportCallsIndex++
+	exportCall(m.t, req)
+	return m, nil
+}


This feels a little awkward/verbose, but it felt like the best way to represent multiple calls to each of these functions.

tstirrat15 · 2025-09-08T22:59:32Z

internal/cmd/restorer_test.go

@@ -183,9 +183,9 @@ func TestRestorer(t *testing.T) {
 	}
 }

-type mockClient struct {
+type mockClientForRestore struct {


Renaming because I created a different one for Backup

since at that point we would have not received any response and we wouldn't have a token. Also made sure to verify the tests failed if I removed the usage of the last known token, which led me to improve the assertions, as the tests were SIGSEV'ing too. Also inverted the !errors.Is(err, io.EOF) check, which is more idiomatic

vroldanbet

Took the liberty to push some changes for an edge case where zed could SIGSEV if the export call fails on the first attempt. Everything else looks good to me!

internal/cmd/restorer_test.go

internal/cmd/backup.go

miparnisari · 2025-09-10T18:13:58Z

internal/cmd/backup.go

+				// last received response.
+
+				// Clone the request to ensure that we are keeping all other fields the same
+				newReq := req.CloneVT()


why do you need to clone the request? does req.OptionalCursor = lastResponse.AfterResultCursor not work?

I don't strictly need to, but it's not expensive and I don't like mutating function parameters.

Fix backup retry behavior

cea55c7

tstirrat15 force-pushed the 542-fix-backup-behavior branch 2 times, most recently from fad38c0 to f5033bb Compare September 8, 2025 22:55

tstirrat15 marked this pull request as ready for review September 8, 2025 22:55

tstirrat15 force-pushed the 542-fix-backup-behavior branch 2 times, most recently from c1a2031 to c9afd03 Compare September 8, 2025 22:57

tstirrat15 commented Sep 8, 2025

View reviewed changes

tstirrat15 force-pushed the 542-fix-backup-behavior branch 2 times, most recently from 9d3cef7 to 9a58bfc Compare September 9, 2025 00:25

Add tests

314ec3a

tstirrat15 force-pushed the 542-fix-backup-behavior branch from 9a58bfc to 314ec3a Compare September 9, 2025 00:37

vroldanbet approved these changes Sep 9, 2025

View reviewed changes

internal/cmd/restorer_test.go Show resolved Hide resolved

internal/cmd/backup.go Show resolved Hide resolved

tstirrat15 merged commit 5faad1c into main Sep 9, 2025
11 checks passed

tstirrat15 deleted the 542-fix-backup-behavior branch September 9, 2025 15:02

github-actions bot locked and limited conversation to collaborators Sep 9, 2025

miparnisari reviewed Sep 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix backup retry behavior #543

Fix backup retry behavior #543

Uh oh!

tstirrat15 commented Sep 2, 2025 •

edited

Loading

Uh oh!

tstirrat15 left a comment

Uh oh!

tstirrat15 Sep 8, 2025

Uh oh!

Uh oh!

tstirrat15 Sep 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tstirrat15 Sep 8, 2025

Uh oh!

tstirrat15 Sep 8, 2025

Uh oh!

vroldanbet left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

miparnisari Sep 10, 2025

Uh oh!

tstirrat15 Sep 11, 2025

Uh oh!

Uh oh!

Fix backup retry behavior #543

Fix backup retry behavior #543

Uh oh!

Conversation

tstirrat15 commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

TODO

Changes

Testing

Uh oh!

tstirrat15 left a comment

Choose a reason for hiding this comment

Uh oh!

tstirrat15 Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tstirrat15 Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tstirrat15 Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

tstirrat15 Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

vroldanbet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

miparnisari Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

tstirrat15 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tstirrat15 commented Sep 2, 2025 •

edited

Loading