sql/importer: fix duplicated row count and lost update for progress #152397

ZhouXing19 · 2025-08-23T18:04:35Z

sql/importer: initialize bulk summary with empty state to prevent duplication

Previously, the accumulated bulk summary was initialized using getLastImportSummary(job), which could carry over stats from previous import attempts (i.e. resuming the same import job). This caused duplication of bulk operation statistics across different attempts of the same import statement execution.

Now initialize accumulatedBulkSummary with an empty kvpb.BulkOpSummary{} to ensure each import attempt starts with a clean state and prevents duplicate entry counts in the job progress summary.

Release note (bug fix): deduplicate import job bulk op summaries in case of import retries.

jobs: add SELECT FOR UPDATE to prevent race conditions in job updates

Fixes #152519

Previously, concurrent job updates (e.g., progress updates and pause/cancel requests) could result in lost updates due to
read-then-write race conditions. Both operations would SELECT job data, modify it in memory, then UPDATE back to the database using separate transactions, causing the later transaction to overwrite changes from the earlier one.

This was particularly problematic for import jobs where:

A progress update routine periodically updates job progress every 10 seconds;
Pause/cancel requests can occur concurrently during job execution.
Both use job.NoTxn() which creates independent transactions that can interfere.

An example timeline would be:

-- Transaction 1 (Progress)        -- Transaction 2 (Pause)
BEGIN;
SELECT progress WHERE job_id=123;
-- reads ResumePos[0] = 10
                                   BEGIN;
                                   SELECT progress WHERE job_id=123;
                                   -- reads ResumePos[0] = 10
UPDATE progress
SET ResumePos[0] = 20
WHERE job_id=123;
COMMIT;
                                   UPDATE jobs
                                   SET status='pause', progress=old_progress
                                   WHERE job_id=123;
                                   -- overwrites with ResumePos[0] = 10
                                   COMMIT;

-- Result: Progress update (20) LOST!

The race condition was reproducible by stress testing TestImportDefaultWithResume.

This commit fixes it by changing the job selection query to a SELECT FOR UPDATE statement in Updater.update(), ensuring that concurrent transactions serialize access to job records. While this introduces some performance overhead due to row-level locking, it prevents silent data corruption and ensures job progress accuracy.

Release notes (bug fix): change job selection query to a SELECT FOR UPDATE statementin Updater.update(), ensuring that concurrent transactions serialize access to job records. While this introduces some performance overhead due to row-level locking, it prevents silent data corruption and ensures job progress accuracy.

cockroach-teamcity · 2025-08-23T18:04:45Z

This change is

…lication Previously, the accumulated bulk summary was initialized using `getLastImportSummary(job)`, which could carry over stats from previous import attempts. This caused duplication of bulk operation statistics across different attempts of the same import statement execution. Now initialize `accumulatedBulkSummary` with an empty `kvpb.BulkOpSummary{}` to ensure each import attempt starts with a clean state and prevents duplicate entry counts in the job progress summary. Release note (bug fix): deduplicate import job bulk op summaries in case of import retries.

Previously, concurrent job updates (e.g., progress updates and pause/cancel requests) could result in lost updates due to read-then-write race conditions. Both operations would SELECT job data, modify it in memory, then UPDATE back to the database using separate transactions, causing the later transaction to overwrite changes from the earlier one. This was particularly problematic for import jobs where: 1. A progress update routine periodically updates job progress every 10 seconds 2. Pause/cancel requests can occur concurrently during job execution Both use job.NoTxn() which creates independent transactions that can interfere. An example timeline would be: ``` -- Transaction 1 (Progress) -- Transaction 2 (Pause) BEGIN; SELECT progress WHERE job_id=123; -- reads ResumePos[0] = 10 BEGIN; SELECT progress WHERE job_id=123; -- reads ResumePos[0] = 10 UPDATE progress SET ResumePos[0] = 20 WHERE job_id=123; COMMIT; UPDATE jobs SET status='pause', progress=old_progress WHERE job_id=123; -- overwrites with ResumePos[0] = 10 COMMIT; -- Result: Progress update (20) LOST! ``` The race condition was reproducible by stress testing TestImportDefaultWithResume. This commit fixes it by changing the job selection query to a `SELECT FOR UPDATE` statement in Updater.update(), ensuring that concurrent transactions serialize access to job records. While this introduces some performance overhead due to row-level locking, it prevents silent data corruption and ensures job progress accuracy. Release notes (bug fix): change job selection query to a SELECT FOR UPDATE statementin Updater.update(), ensuring that concurrent transactions serialize access to job records. While this introduces some performance overhead due to row-level locking, it prevents silent data corruption and ensures job progress accuracy.

…rations The test was hanging when schema change jobs used FOR UPDATE locks because of a circular dependency: the intercepted schema job held database locks while the test tried to cancel it through the beforeUpdate hook mechanism. This change reorders the operations to resume the schema job (releasing FOR UPDATE locks) before attempting cancellation, preventing the deadlock where cancelJob's database operations would block waiting for locks held by the job update query. Fixes the timeout in TestMigrationWithFailures/fail_adding_columns and other test cases that use cancelSchemaJob: true. Release note: None

…ting When FOR UPDATE locks are used in job update queries, the timing of context cancellation changes, which can cause statement timeouts to be reported as generic "query execution canceled" errors instead of the expected "query execution canceled due to statement timeout" message. The issue occurs because FOR UPDATE locks can block the polling queries in WaitForJobs longer, changing when and how context cancellation is detected. This results in cancelchecker.QueryCanceledError being returned instead of the timeout-specific error path being taken. Release note: None

dt · 2025-08-27T13:40:26Z

I feel like I'm missing something wrt to the timelines in the SFU commit; can you help me understand how the second transaction doesn't just get txn restart? I would expect it to hit the intent for the first one if it hasn't committed, or a write too old if it has, and then fail to refresh its (now stale) read when it tries to move its ts past the first txn, and then do a full restart?

dt · 2025-08-27T13:44:15Z

Ah, nevermind, I see there is more detail over on #152519. I'll review its thread and comment there if I'm still not following.

michae2 · 2025-08-29T05:21:43Z

Do these transactions updating the jobs progress ever run under Read Committed isolation? Or always under Serializable?

jeffswenson · 2025-09-02T14:23:37Z

I believe the internal executor only support serializable.

dt · 2025-09-02T14:37:39Z

We met on Friday to dig into what was going on here together; there is a summary on slack but TL;DR is we don’t have an SSI violation as originally suggested, just some buggy IMPORT code that has side-effects (0’ing a buffer) in a txn that only becomes visible on retry.

ZhouXing19 · 2025-09-02T20:25:55Z

Closing in favour of #152745

ZhouXing19 force-pushed the import-job-summary-entrycounts branch 14 times, most recently from 52fc6f2 to 9ab713e Compare August 26, 2025 18:39

ZhouXing19 added 4 commits August 26, 2025 16:45

ZhouXing19 force-pushed the import-job-summary-entrycounts branch from 1cb018e to f39dc36 Compare August 26, 2025 20:45

ZhouXing19 changed the title ~~sql/importer: initialize bulk summary with empty state to prevent duplication~~ sql/importer: fix duplicated row count and lost update for progress Aug 26, 2025

ZhouXing19 marked this pull request as ready for review August 26, 2025 20:59

ZhouXing19 requested review from a team as code owners August 26, 2025 20:59

ZhouXing19 requested review from jeffswenson and removed request for a team August 26, 2025 20:59

ZhouXing19 requested review from mw5h, yuzefovich and michae2 and removed request for mw5h August 26, 2025 20:59

msbutler requested a review from dt August 26, 2025 21:54

ZhouXing19 closed this Sep 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sql/importer: fix duplicated row count and lost update for progress #152397

sql/importer: fix duplicated row count and lost update for progress #152397

Uh oh!

ZhouXing19 commented Aug 23, 2025 •

edited

Loading

Uh oh!

cockroach-teamcity commented Aug 23, 2025

Uh oh!

dt commented Aug 27, 2025

Uh oh!

dt commented Aug 27, 2025

Uh oh!

michae2 commented Aug 29, 2025

Uh oh!

jeffswenson commented Sep 2, 2025

Uh oh!

dt commented Sep 2, 2025

Uh oh!

ZhouXing19 commented Sep 2, 2025

Uh oh!

Uh oh!

sql/importer: fix duplicated row count and lost update for progress #152397

sql/importer: fix duplicated row count and lost update for progress #152397

Uh oh!

Conversation

ZhouXing19 commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

sql/importer: initialize bulk summary with empty state to prevent duplication

jobs: add SELECT FOR UPDATE to prevent race conditions in job updates

Uh oh!

cockroach-teamcity commented Aug 23, 2025

Uh oh!

dt commented Aug 27, 2025

Uh oh!

dt commented Aug 27, 2025

Uh oh!

michae2 commented Aug 29, 2025

Uh oh!

jeffswenson commented Sep 2, 2025

Uh oh!

dt commented Sep 2, 2025

Uh oh!

ZhouXing19 commented Sep 2, 2025

Uh oh!

Uh oh!

ZhouXing19 commented Aug 23, 2025 •

edited

Loading