Definitions of SLOs and SLIs for Packit
If you are not familiar with SLOs and SLIs, please check the parent first.
How do others do it?
The CKI team has a pretty comprehensive set of SLOs. Instead of taking the same route, let's start small and slowly add more and more objectives and indicators over time.
Testing Farm team has Service level expectations and a pretty slick dashboard.
Packit SLOs
SLOs are based on discussions with our stakeholders.
Let's define job statuses first to get a clear understanding of these objectives:
- successful run: a job finished and everything is awesome
- failure: the job finished and wasn't successful: the PR needs to be fixed
- error: infrastructure problems prevented the job to complete
SLO1: Changes to GitHub PRs receive a status update within 15 seconds in 99% of cases
It can be frustrating (for us and our users) when we push changes to our PRs while no statuses are being set. Let's make a deadline for packit to set these.
SLO2: 98% of builds have status set to success or failure within 12 hours
The most core functionality is to run COPR builds for PRs. We want to be sure those builds either pass or fail and no error interrupts the build process.
The problem is that some builds take minutes and some hours so it's hard to design this objective in a generic way for everyone.
SLO3: 95% of test runs have status set to success or failure within 12 hours
Similar as builds but since Testing Farm is outside of our control, let's lower the percentage.
Packit SLIs
If we want to track our SLOs, we need to start measuring different aspects of our workflow.
Number of builds queued
Number of tests queued
Number of builds started
Number of test runs started
Number of builds finished
Number of test runs finished
Time it took to go for a build from queued to finish
Time it took to go for a test from queued to finish
Number of unfinished builds that are in progress for more than 12 hours
Number of unfinished test runs that are in progress for more than 12 hours
Number of PRs handled by packit with no commit status from packit for more than 15s
Time it takes packit to set the initial status
Note on the implementation
All the other teams use prometheus + grafana combo so this will be our choice as well.
Histograms and summaries are used in prometheus to measure durations. One can then use these values for aggregation operations such as averages, sums, max, min, above or below a certain value.
Ideas
- It would be awesome to know if our GitHub app correctly accepts all webhook events but sadly there is no API to get the list, though GitHub has plans for this functionality.