-
Notifications
You must be signed in to change notification settings - Fork 4k
sql/importer: fix duplicated row count and lost update for progress #152397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/importer: fix duplicated row count and lost update for progress #152397
Conversation
52fc6f2
to
9ab713e
Compare
…lication Previously, the accumulated bulk summary was initialized using `getLastImportSummary(job)`, which could carry over stats from previous import attempts. This caused duplication of bulk operation statistics across different attempts of the same import statement execution. Now initialize `accumulatedBulkSummary` with an empty `kvpb.BulkOpSummary{}` to ensure each import attempt starts with a clean state and prevents duplicate entry counts in the job progress summary. Release note (bug fix): deduplicate import job bulk op summaries in case of import retries.
Previously, concurrent job updates (e.g., progress updates and pause/cancel requests) could result in lost updates due to read-then-write race conditions. Both operations would SELECT job data, modify it in memory, then UPDATE back to the database using separate transactions, causing the later transaction to overwrite changes from the earlier one. This was particularly problematic for import jobs where: 1. A progress update routine periodically updates job progress every 10 seconds 2. Pause/cancel requests can occur concurrently during job execution Both use job.NoTxn() which creates independent transactions that can interfere. An example timeline would be: ``` -- Transaction 1 (Progress) -- Transaction 2 (Pause) BEGIN; SELECT progress WHERE job_id=123; -- reads ResumePos[0] = 10 BEGIN; SELECT progress WHERE job_id=123; -- reads ResumePos[0] = 10 UPDATE progress SET ResumePos[0] = 20 WHERE job_id=123; COMMIT; UPDATE jobs SET status='pause', progress=old_progress WHERE job_id=123; -- overwrites with ResumePos[0] = 10 COMMIT; -- Result: Progress update (20) LOST! ``` The race condition was reproducible by stress testing TestImportDefaultWithResume. This commit fixes it by changing the job selection query to a `SELECT FOR UPDATE` statement in Updater.update(), ensuring that concurrent transactions serialize access to job records. While this introduces some performance overhead due to row-level locking, it prevents silent data corruption and ensures job progress accuracy. Release notes (bug fix): change job selection query to a SELECT FOR UPDATE statementin Updater.update(), ensuring that concurrent transactions serialize access to job records. While this introduces some performance overhead due to row-level locking, it prevents silent data corruption and ensures job progress accuracy.
…rations The test was hanging when schema change jobs used FOR UPDATE locks because of a circular dependency: the intercepted schema job held database locks while the test tried to cancel it through the beforeUpdate hook mechanism. This change reorders the operations to resume the schema job (releasing FOR UPDATE locks) before attempting cancellation, preventing the deadlock where cancelJob's database operations would block waiting for locks held by the job update query. Fixes the timeout in TestMigrationWithFailures/fail_adding_columns and other test cases that use cancelSchemaJob: true. Release note: None
…ting When FOR UPDATE locks are used in job update queries, the timing of context cancellation changes, which can cause statement timeouts to be reported as generic "query execution canceled" errors instead of the expected "query execution canceled due to statement timeout" message. The issue occurs because FOR UPDATE locks can block the polling queries in WaitForJobs longer, changing when and how context cancellation is detected. This results in cancelchecker.QueryCanceledError being returned instead of the timeout-specific error path being taken. Release note: None
1cb018e
to
f39dc36
Compare
I feel like I'm missing something wrt to the timelines in the SFU commit; can you help me understand how the second transaction doesn't just get txn restart? I would expect it to hit the intent for the first one if it hasn't committed, or a write too old if it has, and then fail to refresh its (now stale) read when it tries to move its ts past the first txn, and then do a full restart? |
Ah, nevermind, I see there is more detail over on #152519. I'll review its thread and comment there if I'm still not following. |
Do these transactions updating the jobs progress ever run under Read Committed isolation? Or always under Serializable? |
I believe the internal executor only support serializable. |
We met on Friday to dig into what was going on here together; there is a summary on slack but TL;DR is we don’t have an SSI violation as originally suggested, just some buggy IMPORT code that has side-effects (0’ing a buffer) in a txn that only becomes visible on retry. |
Closing in favour of #152745 |
sql/importer: initialize bulk summary with empty state to prevent duplication
Fixes #152543
Previously, the accumulated bulk summary was initialized using
getLastImportSummary(job)
, which could carry over stats from previous import attempts (i.e. resuming the same import job). This caused duplication of bulk operation statistics across different attempts of the same import statement execution.Now initialize
accumulatedBulkSummary
with an emptykvpb.BulkOpSummary{}
to ensure each import attempt starts with a clean state and prevents duplicate entry counts in the job progress summary.Release note (bug fix): deduplicate import job bulk op summaries in case of import retries.
jobs: add SELECT FOR UPDATE to prevent race conditions in job updates
Fixes #152519
Previously, concurrent job updates (e.g., progress updates and pause/cancel requests) could result in lost updates due to
read-then-write race conditions. Both operations would SELECT job data, modify it in memory, then UPDATE back to the database using separate transactions, causing the later transaction to overwrite changes from the earlier one.
This was particularly problematic for import jobs where:
Both use
job.NoTxn()
which creates independent transactions that can interfere.An example timeline would be:
The race condition was reproducible by stress testing
TestImportDefaultWithResume
.This commit fixes it by changing the job selection query to a
SELECT FOR UPDATE
statement in Updater.update(), ensuring that concurrent transactions serialize access to job records. While this introduces some performance overhead due to row-level locking, it prevents silent data corruption and ensures job progress accuracy.Release notes (bug fix): change job selection query to a SELECT FOR UPDATE statementin Updater.update(), ensuring that concurrent transactions serialize access to job records. While this introduces some performance overhead due to row-level locking, it prevents silent data corruption and ensures job progress accuracy.