I am planning significant updates to the lab operations pipeline to address current visibility and reliability issues.
Weekly Lab Statistics
We will generate public reports (viewable on Maestro) for lab owners, we can make them private by sending over email only, i need feedback from lab owners on this. These reports will include:
- Availability: Early failures for labs (online checks) and specific device submissions.
- Volume: Total submitted jobs and jobs per device type.
- Outcomes: Result breakdown per device type (OK, Fail, Infrastructure error, Early failure).
Availability Alerts
We will implement notifications for lab owners when a lab is unavailable for more than N hours. Might be even device, but i am not sure it worth it.
Optimization Plan (Based on Collected Data)
Prioritize High-Value Tests: Ensure sufficient device coverage for tests with active stakeholders.
Load Balancing: Reduce load on overloaded devices and improve tests distribution.
Reliability: Prioritize successful completion over broad coverage (i.e., fewer devices but higher reliability). The same as build numbers, many developers expect to see N tests, X fail, Y pass, so they see the trend. But when number of test results are unstable, it is hard to see the trend and it is hard to trust the results.
Job/Test Scoring: Estimate runtime based on history. Labs may block non-whitelisted tests that exceed duration thresholds (e.g., >3 hours).
Extended observability
Maestro dashboard where lab can see current errors, latest submission details, "stale" answers and other possible problems. This will help lab owners to quickly identify and address issues.
I am planning significant updates to the lab operations pipeline to address current visibility and reliability issues.
Weekly Lab Statistics
We will generate public reports (viewable on Maestro) for lab owners, we can make them private by sending over email only, i need feedback from lab owners on this. These reports will include:
Availability Alerts
We will implement notifications for lab owners when a lab is unavailable for more than N hours. Might be even device, but i am not sure it worth it.
Optimization Plan (Based on Collected Data)
Prioritize High-Value Tests: Ensure sufficient device coverage for tests with active stakeholders.
Load Balancing: Reduce load on overloaded devices and improve tests distribution.
Reliability: Prioritize successful completion over broad coverage (i.e., fewer devices but higher reliability). The same as build numbers, many developers expect to see N tests, X fail, Y pass, so they see the trend. But when number of test results are unstable, it is hard to see the trend and it is hard to trust the results.
Job/Test Scoring: Estimate runtime based on history. Labs may block non-whitelisted tests that exceed duration thresholds (e.g., >3 hours).
Extended observability
Maestro dashboard where lab can see current errors, latest submission details, "stale" answers and other possible problems. This will help lab owners to quickly identify and address issues.