Skip to content

Conversation

tstirrat15
Copy link
Contributor

@tstirrat15 tstirrat15 commented Sep 2, 2025

Fixes #542

Description

Currently, ExportBulk is on the list of retryable methods. That means that when an export stream errors for any reason, the request is re-issued. This is a problem because it starts at the beginning and the relations are appended naively to the output. Ideally we'd resume from the place where the error occurred, and that's what this PR implements.

TODO

  • Test
  • Figure out if the stream needs to be manually closed

Changes

  • Add ExportBulk to the list of things that doesn't get retried
  • Add manual retryable error handling that overwrites the existing stream

Testing

Review. See that tests pass.

@tstirrat15 tstirrat15 force-pushed the 542-fix-backup-behavior branch 2 times, most recently from fad38c0 to f5033bb Compare September 8, 2025 22:55
@tstirrat15 tstirrat15 marked this pull request as ready for review September 8, 2025 22:55
@tstirrat15 tstirrat15 force-pushed the 542-fix-backup-behavior branch 2 times, most recently from c1a2031 to c9afd03 Compare September 8, 2025 22:57
Copy link
Contributor Author

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

// retrying the bulk import in backup/restore logic is handled manually.
// retrying bulk export is also handled manually, because the default behavior is
// to start at the beginning of the stream, which produces duplicate values.
selector.StreamClientInterceptor(retry.StreamClientInterceptor(retryOpts...), selector.MatchFunc(isNoneOf(importBulkRoute, exportBulkRoute))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is half of the fix - we don't want to automatically retry ExportBulk requests, because then we're not properly handling restarting the stream.

Comment on lines 435 to 440
// Clone the request to ensure that we are keeping all other fields the same
newReq := req.CloneVT()
newReq.OptionalCursor = lastResponse.AfterResultCursor

relationshipStream, err = spiceClient.ExportBulkRelationships(ctx, newReq)
log.Info().Err(err).Str("cursor token", lastResponse.AfterResultCursor.Token).Msg("encountered retryable error, resuming stream after token")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other half of the fix - manually catching retryable errors and ensuring that we resume the stream in the correct spot.

Comment on lines +744 to +768
func (m *mockClientForBackup) Recv() (*v1.ExportBulkRelationshipsResponse, error) {
// If we've run through all our calls, return an EOF
if m.recvCallIndex == len(m.recvCalls) {
return nil, io.EOF
}
recvCall := m.recvCalls[m.recvCallIndex]
m.recvCallIndex++
return recvCall()
}

func (m *mockClientForBackup) ExportBulkRelationships(_ context.Context, req *v1.ExportBulkRelationshipsRequest, _ ...grpc.CallOption) (grpc.ServerStreamingClient[v1.ExportBulkRelationshipsResponse], error) {
if m.exportCalls == nil {
// If the caller doesn't supply exportCalls, pass through
return m, nil
}
if m.exportCallsIndex == len(m.exportCalls) {
// If invoked too many times, fail the test
m.t.FailNow()
return m, nil
}
exportCall := m.exportCalls[m.exportCallsIndex]
m.exportCallsIndex++
exportCall(m.t, req)
return m, nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little awkward/verbose, but it felt like the best way to represent multiple calls to each of these functions.

@@ -183,9 +183,9 @@ func TestRestorer(t *testing.T) {
}
}

type mockClient struct {
type mockClientForRestore struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming because I created a different one for Backup

@tstirrat15 tstirrat15 force-pushed the 542-fix-backup-behavior branch 2 times, most recently from 9d3cef7 to 9a58bfc Compare September 9, 2025 00:25
@tstirrat15 tstirrat15 force-pushed the 542-fix-backup-behavior branch from 9a58bfc to 314ec3a Compare September 9, 2025 00:37
since at that point we would have not received any response
and we wouldn't have a token.

Also made sure to verify the tests failed if I removed
the usage of the last known token, which led me to
improve the assertions, as the tests were SIGSEV'ing too.

Also inverted the !errors.Is(err, io.EOF) check, which
is more idiomatic
Copy link
Contributor

@vroldanbet vroldanbet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took the liberty to push some changes for an edge case where zed could SIGSEV if the export call fails on the first attempt. Everything else looks good to me!

@tstirrat15 tstirrat15 merged commit 5faad1c into main Sep 9, 2025
11 checks passed
@tstirrat15 tstirrat15 deleted the 542-fix-backup-behavior branch September 9, 2025 15:02
@github-actions github-actions bot locked and limited conversation to collaborators Sep 9, 2025
// last received response.

// Clone the request to ensure that we are keeping all other fields the same
newReq := req.CloneVT()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to clone the request? does req.OptionalCursor = lastResponse.AfterResultCursor not work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't strictly need to, but it's not expensive and I don't like mutating function parameters.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zed backup doesn't resume where it left off
3 participants