Splitting source-git and upstream

Some ideas, possibilities, pros and cons of moving out the source-git related work.

What does the source-git workflow mean?

Must have:

If user creates a merge-request on the source-git repository:
- Create a matching merge-request to the dist-git repository.
- Sync the CI results from the dist-git merge-request to the source-git merge-request.
If the dist-git is updated, update the source-git repository by opening a PR.
User is able to convert source-git change to the dist-git change locally via CLI.

Should have:

If the source-git merge-request is updated, update the dist-git merge-request.
If the source-git merge-request is closed, close the dist-git merge-request.

Could have:

User is able to re-trigger the dist-git CI from the source-git merge-request.
User is able to re-create the dist-git MR from the source-git merge-request.

Key questions:

Should we start developing a new service or modify the existing packit-service to be able to deploy only with a gitlab-endpoint?
How to link merge requests in the src namespace to the ones in the rpms namespace using the GitLab API? This should be bidirectional.
Check the GitLab API to learn more about working with merge-trains and pipelines, in order to support the UX for merging source-git MRs (see the doc linked bellow).
How are CI results going to be displayed in dist-git MRs? We wan't to know this so that we can think about ways to take those results and display them for contributors on the source-git MRs.

Split

We have multiple options:

0. no split

No extra cost of two deployments and two codebases.
New jobs will be implemented as new handlers.
We don't have support for multiple identities for one forge.
- Not hard to do but requires some work.
Fedora-source-git friendly: easy combination of events (different for Fedora and Stream) and handlers (=implementation, can be shared).

1. same codebase, new deployment

No extra cost of maintenance of two codebases.
New jobs are implemented as new handlers.
Different identities can be used in one git forge (=gitlab.com).
Resources can be tweaked separately.
Fedora-source-git friendly: easy combination of events (different for Fedora and Stream) and handlers (=implementation, can be shared).

2. separate workers

New jobs are implemented as new handlers in a separate repository.
The centos-stream related code is in one place, based-on the packit-service code.
We can use same or a separate deployment. (Fedora-source-git can be separated or with Stream.)

3. split the packit-service repo and build upstream/centos-stream workers

One repo with the scheduler. Two repositories with the worker(=handlers) definition: one for the upstream, one for the stream.
Requires more work.
Can lead to a cleaner architecture. Something we were discussing for some time.
- The centos-stream related code is in one place, upstream code is in one place and the shared code is in one place.
Another dependency in the chain.
- It's sometimes hard to work on the functionality that goes across multiple git projects.
We can use same or a separate deployment. (Fedora-source-git can be separated or with Stream.)

4. fork and improve

The benefits of the current service code can be preserved.
The non-relevant/bad code can be removed.
More time needed for development.
Improvements relevant for both are hard to sync.
More time needed for maintenance.

5. separate project from scratch

The new service can be more lightweight and efficient.
We can iterate on the prototype more quickly.
We can get rid of the old bad staff in the packit-service.
We can go through the same pain we've already gone through.
We need to maintain two separate projects. (We need more people or reduce the productivity.)
We are not motivated to improve the current projects.
To share the code between the upstream and source project, we need to create some shared libraries.
- Can lead to 3. and/or having another project on our dependency chain.

Dashboard

For the current goals, there is no need for having a dashboard.
- GitLab is our interface, and we can link the same result links as the CI in dist-git.

Database

What are the differences in the schema?

Event related models can be shared (GitProject, JobTriggerModel, RunModel, PullRequestModel`, ...).
Result models are probably not necessary (SRPMBuildModel, CoprBuildModel, ...).
Allow/deny list can be preserved.
Models for the setup procedure aren't necessary for stream (InstallationModel, ProjectAuthenticationIssueModel).
Connection between source-git and dist-git MRs can be done by creating a new join table.
If we need to track the results, we need to create a new model for that.

We have multiple ways how we can manage database schema after the split:

One schema for all

One schema that works for all use-cases.
The schema is defined in one place.
Databases can contain unused tables.
Schema can be more complicated.

Independent schemas

Each schema fits the use-case.
Common changes are harder to share.

Multiple alembic branches

https://alembic.sqlalchemy.org/en/latest/branches.html
Each use-case has its own alembic branch.
The definition is in one place.
Does not help with sharing of changes. (There is only merge and no rebase available.)

Multiple alembic bases

https://alembic.sqlalchemy.org/en/latest/branches.html#working-with-multiple-bases
Multiple independent alembic migration paths. (We work on multiple branches in parallel.)
Different base can be located in a different location.
- There can be one shared schema and each use-case (upstream/CentOS-stream/?Fedora) can have a second one.
Ideally, bases are independent, but it is possible to specify dependencies between revisions.

Linking of the merge-requests

We have two goals:

User can easily find the related merge request on the web UI. (In both ways.)
We can get the related merge request from the service.

We can set the dependent merge-requests across multiple projects (e.g. https://gitlab.com/lachmanfrantisek/kernel-ark/-/merge_requests/1)
- The API is not implemented: https://gitlab.com/gitlab-org/gitlab/-/issues/12551
Mentioning the other merge-request is always a possibility. (Mapping would be problematic.)
For the purpose of the service, we can save the pairs to the database.

Some related GitLab issues:

Merge trains

Allow merging multiple MRs in one target branch safely:

We put MRs to the queue.
For each MR, we run pipeline on the code containing this MR and all before.
Pipelines are run in parallel to safe time.

Conclusion:

This feature is something like zuul's gating pipeline with auto-rebase done in parallel.
It's not meant for cross-project MRs => not useful for us.

Sources:

Multi-project pipelines

https://docs.gitlab.com/ee/ci/multi_project_pipelines.html
Pipeline to trigger a pipeline in a different project (e.g. generate documentation in a different repo once code change is merged).

Conclusion:

Don't give us much benefits. (We need to work with dynamic reference on the second repository.)

Parent-child pipelines

https://docs.gitlab.com/ee/ci/parent_child_pipelines.html
Pipeline can trigger a set of concurrently running child pipelines within the same project.
- Child jobs are not dependant on the state of non-related jobs on parent pipeline.
- Configuration can be split into multiple smaller easy-to-understand parts.
- Avoids name collisions. (Comparing to pure import.)

Pipelines API

Looks like pipelines need to be defined beforehand. We can manipulate only defined ones.

REST API: https://docs.gitlab.com/ee/api/pipelines.html
Python API: https://python-gitlab.readthedocs.io/en/stable/gl_objects/pipelines_and_jobs.html

Pipelines for merge requests are configured. A detached pipeline runs in the context of the merge request, and not against the merged result. Learn more in the documentation for Pipelines for Merged Results.

To support this we can define the pipelines, wait for and get the result from Packit API. (Goes against the current Packit workflow.)

We can also have a custom gitlab runner running our implementation in our infrastructure.

Results are shown as pipelines.
Completely independent to our celery-base workflow.
A funny example: https://about.gitlab.com/blog/2018/06/29/introducing-auto-breakfast-from-gitlab/

Potentially, we can use only: - external when defining the pipeline and combine it with the commit statuse:

https://gitlab.com/gitlab-org/gitlab/-/issues/20907#note_300399873
But I can't make this approach work.

Some related GitLab issues:

Commit status

Converted to external jobs of detached pipeline.
Only one stage (=column) is possible in the pipeline chart: https://gitlab.com/gitlab-org/gitlab/-/issues/19177
https://docs.gitlab.com/ee/api/commits.html#commit-status
Python API: https://python-gitlab.readthedocs.io/en/stable/gl_objects/commits.html#commit-status
OGR: https://packit.github.io/ogr/services/gitlab/flag.html

CI results

Currently, the CI results are shown as pipelines.
Example: https://gitlab.com/redhat/centos-stream/rpms/hdparm/-/merge_requests/1/pipelines
There is a Python API for pipelines: https://python-gitlab.readthedocs.io/en/stable/gl_objects/pipelines_and_jobs.html
- (No OGR support yet: https://github.com/packit/ogr/issues/420)

Answers to Key questions

Should we start developing a new service or modify the existing packit-service to be able to deploy only with a gitlab-endpoint?
- If we want to have a clean architecture, we can use version 3 with separate deployments. Version 2 can be done as a middle step.
How to link merge requests in the src namespace to the ones in the rpms namespace using the GitLab API? This should be bidirectional.
- There is only a GUI to do that. We can use comments via API.
Check the GitLab API to learn more about working with merge-trains and pipelines, in order to support the UX for merging source-git MRs (see the doc linked bellow).
- Looks like we can't use any GitLab structure to make this automatic, but can provide the UX independently.
- We can define pipelines or use commit statuses (=detached pipelines).
How are CI results going to be displayed in dist-git MRs? We wan't to know this so that we can think about ways to take those results and display them for contributors on the source-git MRs.
- Displayed as pipelines.

The plan

Set up a new repository for the stream worker.
- stream-worker/packit-stream-worker/source-git-worker/...
- Build the image in quay.
- Setup zuul and pre-commit.ci.
- Create a stable branch.
Set up a new deployment repository for stream.
- Create the new playbooks and share as much as possible with current workflow.
- We will have only one deployment for stream for start.
- Increase (=buy) the resources in openshift online.
- Deploy the stream service to openshift online.
- Update the script for moving stable branches.
Implement the stream worker.
- New celery tasks are defined.
- Implementation is done as new handlers.
- Start with the really basic version so we can work on deployment ASAP:
  - If user creates a merge-request on the source-git repository, create a matching merge-request to the dist-git repository.
Put the current worker away from the service.
- packit-worker/packit-upstream-worker/...
- What about the process_message task? (SPIKE card for that has been created.)
  - Do we want to share it? What about a dedicated worker just for this?
- Move the build process.
- Setup zuul and pre-commit.ci.
- Update deployment if necessary.
- Update the script for moving stable branches.
- (Can be done in parallel with other steps.)
Implement Sync the CI results from the dist-git merge-request to the source-git merge-request.
Implement If the dist-git is updated, update the source-git repository by opening a PR.
Implement If the source-git merge-request is updated, update the dist-git merge-request.
Implement If the source-git merge-request is closed, close the dist-git merge-request.

What does the source-git workflow mean?​

Key questions:​

Split​

0. no split​

1. same codebase, new deployment​

2. separate workers​

3. split the packit-service repo and build upstream/centos-stream workers​

4. fork and improve​

5. separate project from scratch​

Dashboard​

Database​

One schema for all​

Independent schemas​

Multiple alembic branches​

Multiple alembic bases​

Linking of the merge-requests​

Merge trains​

Multi-project pipelines​

Parent-child pipelines​

Pipelines API​

Commit status​

CI results​

Answers to Key questions​

The plan​