Skip to content

Commit 40feda1

Browse files
committed
fix: retry all SQS messages on unhandled errors instead of silently dropping them
When the scale-up lambda encounters an unhandled error (e.g. SSM ThrottlingException during registration token creation), the catch block returned an empty batchItemFailures array. With ReportBatchItemFailures enabled, this tells SQS that all messages were processed successfully, permanently deleting them from the queue. This causes queued GitHub Actions jobs to be silently lost — they never get a runner and remain stuck in 'queued' state indefinitely. The fix returns all message IDs as batch item failures on unhandled errors, so SQS retries them after the visibility timeout.
1 parent 1d57199 commit 40feda1

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

lambdas/functions/control-plane/src/lambda.ts

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,11 @@ export async function scaleUpHandler(event: SQSEvent, context: Context): Promise
5555
batchItemFailures.push(...e.toBatchItemFailures(sqsMessages));
5656
logger.warn(`${e.detailedMessage} A retry will be attempted via SQS.`, { error: e });
5757
} else {
58-
logger.error(`Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, ignoring batch`, {
59-
error: e,
60-
});
58+
batchItemFailures.push(...sqsMessages.map(({ messageId }) => ({ itemIdentifier: messageId })));
59+
logger.error(
60+
`Error processing batch (size: ${sqsMessages.length}): ${(e as Error).message}, all messages will be retried via SQS.`,
61+
{ error: e },
62+
);
6163
}
6264

6365
return { batchItemFailures };

0 commit comments

Comments
 (0)