Refactor RemoteRepository object
This document describes the current usage of RemoteRepository objects and proposes a new normalized modeling.
Goals
De-duplicate data stored in our database.
Save only one
RemoteRepositoryper GitHub repository.Use an intermediate table between
RemoteRepositoryandUserto store associated remote data for the specific user.Make this model usable from our SSO implementation (adding
remote_idfield inRemoteobjects).Use Post
JSONFieldto store associatedjsonremote data.Make
Projectconnect directly toRemoteRepositorywithout being linked to a specificUser.Do not disconnect
ProjectandRemoteRepositorywhen a user delete/disconnects their account.
Non-goals
Keep
RemoteRepositoryin sync with GitHub repositories.Delete
RemoteRepositoryobjects deleted from GitHub.Listen to GitHub events to detect
full_namechanges and update our objects.
Note
We may need/want some of these non-goals in the future. They are just outside the scope of this document.
Current implementation
When a user connect their account to a social account, we create a
allauth.socialaccount.models.SocialAccount* basic information (provider, last login, etc) * provider’s specific data saved in a JSON underextra_dataallauthsocialaccount.models.SocialToken* token to hit the API on behalf the user
We don’t create any RemoteRepository at this point.
They are created when the user jumps into “Import Project” page and hit the circled arrows.
It triggers sync_remote_repostories task in background that updates or creates RemoteRepositories,
but it does not delete them (after #7183 and #7310 got merged, they will be deleted).
One RemoteRepository is created per repository the User has access to.
Note
In corporate, we are automatically syncing RemoteRepository and RemoteOganization
at signup (foreground) and login (background) via a signal. We should eventually move these to community.
Where RemoteRepository is used?
List of available repositories to import under “Import Project”
Show a “+”, “External Arrow” or a “Lock” sign next to the element in the list * +: it’s available to be imported * External Arrow: the repository is already imported (see RemoteRepository.matches method) * Lock: user doesn’t have (admin) permissions to import this repository (uses
RemoteRepository.privateandRemoteRepository.admin)Avatar URL in the list of project available to import
Update webhook when user clicks “Resync webhook” from the Admin > Integrations tab
Send build status when building Pull Requests
New normalized implementation
The ManyToMany relation RemoteRepository.users will be changed to be ManyToMany(through='RemoteRelation')
to add extra fields in the relation that are specific only for the User.
Allows us to have only one RemoteRepository per GitHub repository with multiple relationships to User.
With this modeling, we can avoid the disconnection Project and RemoteRepository only by removing the RemoteRelation.
Note
All the points mentioned in the previous section may need to be adapted to use the new normalized modeling. However, it may be only field renaming or small query changes over new fields.
Use this modeling for SSO
We can get the list of Project where a user as access:
admin_remote_repositories = RemoteRepository.objects.filter(
users__contains=request.user,
users__remoterelation__admin=True, # False for read-only access
)
Project.objects.filter(remote_repository__in=admin_remote_repositories)
Rollout plan
Due the constraints we have in the RemoteRepository table and its size,
we can’t just do the data migration at the same time of the deploy.
Because of this we need to be more creative here and find a way to re-sync the data from VCS providers,
while the site continue working.
To achieve this, we thought on following this steps:
1. modify all the Python code to use the new modeling in .org and .com (will help us to find out bugs locally in an easier way)
1. QA this locally with test data
1. enable Django signal to re-sync RemoteRepository on login async (we already have this in .com). New active users will have updated data immediately
1. spin up a new instance with the new refactored code
1. run migrations to create a new table for RemoteRepository
1. re-sync everything from VCS providers into the new table for 1-week or so
1. dump-n-load Project - RemoteRepository relations
1. create a migration to use the new table with synced data
1. deploy new code once the sync is finished
See these issues for more context: * https://github.com/readthedocs/readthedocs.org/pull/7536#issuecomment-724102640 * https://github.com/readthedocs/readthedocs.org/pull/7675#issuecomment-732756118