test(spanner): make samples instance and backup cleanup robust against FailedPrecondition and quota exhaustion#17141
Conversation
…t FailedPrecondition and quota exhaustion
There was a problem hiding this comment.
Code Review
This pull request improves the robustness of the Spanner sample cleanup process by centralizing instance scrubbing logic into a shared helper function and implementing a more aggressive retry strategy for resource exhaustion and failed preconditions. The updates also ensure that databases are explicitly dropped during the cleanup of an instance. A review comment suggests further improving the error handling in the cleanup helper by catching a broader range of API errors and logging them as warnings to prevent silent failures while maintaining visibility for debugging.
|
|
||
| retry_429(to_scrub.delete)() | ||
| retry_cleanup(to_scrub.delete)() | ||
| except exceptions.NotFound: |
There was a problem hiding this comment.
To prevent the cleanup process from aborting due to a failure on a single instance, catch the broader exceptions.GoogleAPICallError instead of just NotFound. As this is a background task designed for retries, ensure the exception is logged as a warning to provide debugging information without creating excessive error noise.
| except exceptions.NotFound: | |
| except exceptions.GoogleAPICallError as e: | |
| logger.warning(f"Error during cleanup: {e}") |
References
- For exceptions in background tasks that are designed to be retried, log them as warnings rather than errors to reduce noise from transient, recoverable failures.
- Avoid broad except Exception: blocks that silently return None. Instead, log the exception to aid in debugging and prevent masking underlying issues.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Root Cause
During Kokoro continuous test runs for Cloud Spanner samples, tests restore databases from backups. The restored databases hold active backend references to the backups (
backup.referencing_databases) while optimizing in the background.During module teardown,
conftest.pycallsdatabase.drop()and immediately iterates tobackup.delete(). Becausedatabase.drop()is asynchronous on the Spanner backend,backup.delete()frequently fails with aFailedPreconditionerror (backup in use). This unhandled exception crashes the fixture teardown generator, skippinginstance.delete()and leaking instances and backups.As these leaked resources accumulate across frequent Kokoro runs in the shared test project, subsequent runs fail due to
ResourceExhausted(quota exceeded) errors.Solution
retry_cleanup) that retries on bothResourceExhaustedandFailedPrecondition.try...except exceptions.GoogleAPICallErrorblocks so that an error on a single resource never aborts the scrubbing generator.cleanup_old_instances,sample_instance, andmulti_region_instanceteardown fixtures.