How to manage outages

Who is responsible for resolving outages and what should be done

rotating role - new / incorporate to an existing one (Sentry master, community master)
identify an outage
inform team
inform users
be responsible for fixing the issue = try to fix it / ask in the team
inform when the outage is resolved

Sentry
- mostly info about new exceptions
- currently we are informed about each new Sentry issue by mail
- alerts
  - alert rules can be set up - threshold, notification, example
  - alerting is triggered when a threshold is breached
Prometheus can provide alerts as well - we would need to collect some metric about failed tasks and create alert rules for that
analyze DB info we use in dashboard jobs - periodically check the state of the SRPM builds/ Copr builds/ tests in DB from e.g. the last hour
- cron job to check this can be set up

our status page in dashboard
- currently we have there only info about API which only checks https://prod.packit.dev/api/healthz
- we could provide more info:
  - manually added info about current outage, have some template: what and when happened, when it will be fixed
  - history of outages
- if Openshift is down, we cannot communicate here
pinned issue in Github about current outage
- link to dashboard?
IRC - also provide links
status page hosted some other place then the actual project, so that it's possible to communicate the status even if OpenShift is down:
- Cachet - self-hosted status page system
- Statusfy - Static Generated or Server Rendered, variety of hosting services
- cState - built with Hugo, statically generated, can be hosted for free on Github Pages
- Corestatus - turns GitHub issues into a status page, hosted on GitHub
for certain known failures of Testing Farm or Copr "neutral" status on PRs (instead of an error)
- the checks need to be implemented (opened issue for this)
- could be configurable
- better UX for users - PRs are not marked as failed

scope:
- what functionality is broken
- what/who is affected
provide alternatives if there are some
when can users expect a fix
if the outage is caused by other tool/service - provide links (e.g. Copr outage, Testing farm)