Skip to main content

Definitions of SLOs and SLIs for Packit

If you are not familiar with SLOs and SLIs, please check the parent first.

How do others do it?

The CKI team has a pretty comprehensive set of SLOs. Instead of taking the same route, let's start small and slowly add more and more objectives and indicators over time.

Testing Farm team has Service level expectations and a pretty slick dashboard.

Packit SLOs

SLOs are based on discussions with our stakeholders.

Let's define job statuses first to get a clear understanding of these objectives:

  • successful run: a job finished and everything is awesome
  • failure: the job finished and wasn't successful: the PR needs to be fixed
  • error: infrastructure problems prevented the job to complete

SLO1: Changes to GitHub PRs receive a status update within 15 seconds in 99% of cases

It can be frustrating (for us and our users) when we push changes to our PRs while no statuses are being set. Let's make a deadline for packit to set these.

SLO2: 98% of builds have status set to success or failure within 12 hours

The most core functionality is to run COPR builds for PRs. We want to be sure those builds either pass or fail and no error interrupts the build process.

The problem is that some builds take minutes and some hours so it's hard to design this objective in a generic way for everyone.

SLO3: 95% of test runs have status set to success or failure within 12 hours

Similar as builds but since Testing Farm is outside of our control, let's lower the percentage.

Packit SLIs

If we want to track our SLOs, we need to start measuring different aspects of our workflow.

  • Number of builds queued

  • Number of tests queued

  • Number of builds started

  • Number of test runs started

  • Number of builds finished

  • Number of test runs finished

  • Time it took to go for a build from queued to finish

  • Time it took to go for a test from queued to finish

  • Number of unfinished builds that are in progress for more than 12 hours

  • Number of unfinished test runs that are in progress for more than 12 hours

  • Number of PRs handled by packit with no commit status from packit for more than 15s

  • Time it takes packit to set the initial status

Note on the implementation

All the other teams use prometheus + grafana combo so this will be our choice as well.

Histograms and summaries are used in prometheus to measure durations. One can then use these values for aggregation operations such as averages, sums, max, min, above or below a certain value.

Ideas