Skip to content

Upgrade actions can be ignored due to stale entries in bkgActions #9629

@pkoutsovasilis

Description

@pkoutsovasilis

Background

Actions dequeued from the dispatcher action store and dispatched to the coordinator upgrader:

  1. are not retried on failure
  2. Are not persisted across an agent restart (although they should be re-sent by fleet-server on the first checkin after restart?!)

Note: this is probably worth tracking as a separate investigation issue.

Issue

If Elastic-Defend and tamper protection of agent is enabled, action can remain stale in the bkgActions. Specifically we add the action invoking getAsyncContext but we return an err without emptying bkgActions in case Elastic-Defend can't acknowledge the upgrade action

asyncCtx, runAsync := h.getAsyncContext(ctx, a, ack)
if !runAsync {
return nil
}
if h.tamperProtectionFn() {
// Find inputs that want to receive UPGRADE action
// Endpoint needs to receive a signed UPGRADE action in order to be able to uncontain itself
state := h.coord.State()
ucs := findMatchingUnitsByActionType(state, a.Type())
if len(ucs) > 0 {
h.log.Debugf("handlerUpgrade: proxy/dispatch action '%+v'", a)
err := notifyUnitsOfProxiedAction(ctx, h.log, action, ucs, h.coord.PerformAction)
h.log.Debugf("handlerUpgrade: after action dispatched '%+v', err: %v", a, err)
if err != nil {
return err
}
} else {
// Log and continue
h.log.Debugf("No components running for %v action type", a.Type())
}
}

Impact

  • Upgrade actions may remain permanently stuck in bkgActions.
  • Subsequent upgrade attempts with the same version and source are ignored.
  • Likely the cause of multiple recent internal error reports.

For confirmed bugs, please report:

  • Version: All active releases
  • Operating System: All

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions