Who is responsible for resolving outages and what should be done
- rotating role - new / incorporate to an existing one (Sentry master, community master)
- identify an outage
- inform team
- inform users
- be responsible for fixing the issue = try to fix it / ask in the team
- inform when the outage is resolved
How to identify an outage
- Sentry
- mostly info about new exceptions
- currently we are informed about each new Sentry issue by mail
- alerts
- alert rules can be set up - threshold, notification,
example
- alerting is triggered when a threshold is breached
- Prometheus can provide alerts as well - we would need to collect some metric about failed tasks and create alert rules
for that
- analyze DB info we use in dashboard jobs - periodically check the state of the SRPM builds/ Copr builds/ tests in DB from e.g. the last hour
- cron job to check this can be set up
- IRC - the messages are mirrored to Telegram, this should be enough
- our status page in dashboard
- currently we have there only info about API which only checks https://prod.packit.dev/api/healthz
- we could provide more info:
- manually added info about current outage, have some template: what and when happened, when it will be fixed
- history of outages
- if Openshift is down, we cannot communicate here
- pinned issue in Github about current outage
- IRC - also provide links
- status page hosted some other place then the actual project, so that it's possible to communicate the status even if OpenShift is down:
- Cachet - self-hosted status page system
- Statusfy - Static Generated or Server Rendered, variety of hosting
services
- cState - built with Hugo, statically generated, can be hosted for free on Github Pages
- Corestatus - turns GitHub issues into a status page, hosted on GitHub
- for certain known failures of Testing Farm or Copr "neutral" status on PRs (instead of an error)
- the checks need to be implemented (opened issue for this)
- could be configurable
- better UX for users - PRs are not marked as failed
What should be included
- scope:
- what functionality is broken
- what/who is affected
- provide alternatives if there are some
- when can users expect a fix
- if the outage is caused by other tool/service - provide links (e.g. Copr outage, Testing farm)