The monitor_submissions command is a file monitoring and ingestion service that continuously processes KernelCI submission data from a spool directory. It monitors for new JSON files containing any database object data (tests, builds, ...), processes them in parallel, and ingests them into the database.
This command implements a robust ingestion pipeline that:
- Monitors a spool directory for new JSON files
- Processes files in parallel using configurable worker threads
- Validates and transforms submission data
- Handles log excerpts by uploading large ones to external storage
- Inserts data into the database using Django models and ORM
- Archives successfully processed files to prevent reprocessing
- Maintains data integrity with proper error handling and file management
--spool-dir: Path to the spool directory containing JSON files and required subfolders (failed/andarchive/)
--max-workers: Maximum number of worker threads for parallel processing (default: 5)--interval: Check interval in seconds between directory scans (default: 5)--trees-file: Path to YAML file mapping tree names to their URLs (overrides default path "/app/trees.yaml")
The command uses the following environment variables:
STORAGE_TOKEN: Bearer token for uploading log excerpts to external storageSTORAGE_BASE_URL: Base URL for storage service (default: "https://files-staging.kernelci.org")
If STORAGE_TOKEN is not set, log_excerpts will not be uploaded and the original log_excerpt will be inserted in the database.
It is possible to run the ingester within docker, as visible in the docker-compose file. If you want to test it there, it is suggested to change the volume mount from ../spool to just ./backend/spool, which will allow you to interact with it from inside the project files.
If you have permission errors with the directory, try fixing it with sudo chown -R $USER:$USER ./backend/spool as this will change the ownership from docker to your user.
The spool directory must have the following structure:
spool-dir/
├── *.json # Input files to be processed
├── failed/ # Files that failed processing
└── archive/ # Successfully processed files
The command will automatically create missing subdirectories if they don't exist.
- Scans the spool directory for
.jsonfiles everyintervalseconds - Ignores empty files (deletes them automatically)
- Uses ThreadPoolExecutor with configurable
max-workers - Each file is processed in a separate thread for I/O operations
- Database operations are serialized through a single worker thread
- Tree Name Standardization: Maps git repository URLs to standardized tree names using trees.yaml
- Log Excerpt Processing: Large log excerpts (>256 bytes) are uploaded to external storage and replaced with URLs
- Schema Validation: Validates data against KernelCI schema using library
kcidb-io - Schema Upgrade: Upgrades data to the latest schema version if not already up to date
- Success: Files are moved to
archive/subdirectory - Failure: Files are moved to
failed/subdirectory - Empty: Files are deleted immediately
- Maintains an in-memory cache of uploaded log excerpts to avoid duplicate uploads
- Automatically clears cache when it exceeds 100,000 entries (arbitrary limit set by
CACHE_LOGS_SIZE_LIMITvariable)
python manage.py monitor_submissions --spool-dir /path/to/spoolpython manage.py monitor_submissions \
--spool-dir /path/to/spool \
--max-workers 10 \
--interval 2python manage.py monitor_submissions \
--spool-dir /path/to/spool \
--trees-file /custom/path/trees.yaml- JSON Parse Errors: Invalid JSON files are moved to
failed/ - Schema Validation Errors: Files that don't match KernelCI schema are moved to
failed/ - Database Errors: Ignores data and logs error to console, without stopping the execution
- Logexcerpt Storage Upload Errors: Falls back to storing original log excerpt in database if upload fails.
- Failed files remain in
failed/directory for manual inspection - The command can be restarted safely - it will resume processing new files
- Archived files are preserved and won't be reprocessed
- Part of the KernelCI dashboard ingestion pipeline
- Designed to work with KernelCI data submission format (schema v5.3)
- Integrates with external storage service for log excerpt management
- Uses Django ORM for database operations through
insert_submission_data - Command ported from the kcidb-ng repository on PR 1372