Skip to content

feat: add the archived repositories checker [IN-551] #3320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

borfast
Copy link
Collaborator

@borfast borfast commented Aug 11, 2025

Implements the recurring archived repository checker, for IN-551.

Since this was asked to be a cronjob, it doesn't use the usual Temporal structure. It is meant to be executed periodically with a Kubernetes CronJob resource (added on the crowd-kube repository) but can also be executed independently, if necessary.

Instead of Temporal, it uses BullMQ to manage the task queue, which is much simpler and runs locally, needing only Redis.

This can also serve as an example / template for other simple recurring tasks we want to run which don't require the complexity of Temporal, and maybe allows us to slowly start moving away from it to reduce costs and simplify things.

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
@borfast borfast requested review from ulemons, mbani01 and Copilot August 11, 2025 14:45
@borfast borfast self-assigned this Aug 11, 2025
Copy link

gitguardian bot commented Aug 11, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
19935597 Triggered Generic Password f0f0841 services/cronjobs/archived_repositories/docker-compose.yaml View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new cronjob service for checking archived status of GitHub and GitLab repositories. The service uses BullMQ instead of Temporal for task queue management, providing a simpler alternative for recurring tasks that don't require Temporal's complexity.

Key changes:

  • Creates a new cronjob service architecture using BullMQ and Redis for task queuing
  • Implements rate-limited API clients for GitHub and GitLab to check repository archive status
  • Adds database migration to track last archived check timestamps

Reviewed Changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
services/cronjobs/archived_repositories/src/main.ts Main orchestrator that batches repository URLs and enqueues jobs
services/cronjobs/archived_repositories/src/workers.ts BullMQ workers with platform-specific rate limiting for API calls
services/cronjobs/archived_repositories/src/database.ts Database operations for fetching URLs and updating repository status
services/cronjobs/archived_repositories/src/config.ts Configuration management with environment variable validation
services/cronjobs/archived_repositories/src/clients/*.ts API clients for GitHub and GitLab archive status checks
services/cronjobs/archived_repositories/src/utils.ts URL parsing utility for extracting platform/owner/repo information
services/cronjobs/archived_repositories/src/types.ts Type definitions and constants for the service
services/cronjobs/archived_repositories/package.json Dependencies and build scripts for the cronjob service
backend/src/database/migrations/*.sql Database migration adding last_archived_check column and index
pnpm-workspace.yaml Excludes cronjobs from workspace to allow independent dependency management
Files not reviewed (1)
  • services/cronjobs/archived_repositories/pnpm-lock.yaml: Language not supported

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
…ardian

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Comment on lines +5 to +8
const data = await ofetch(`https://api.github.com/repos/${owner}/${repo}`, {
headers: { Authorization: `Bearer ${config.GithubToken}` },
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we handling unsuccessful responses here? I see there is an existing logic to limit the number of requests to avoid rate limits, but what will happen if we receive any other response besides 200?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday I asked Joana about the same thing, more specifically about when we get an error from GitHub (for example) saying the repository does not exist. We decided to add a new deleted column to track that but we didn't talk about the general error scenario.

BullMQ can automatically retry failed jobs depending on settings, so I would say we could use that as a first measure. After that, BullMQ puts all the failed jobs in a special failed queue, to which we can later add some monitoring if we want to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I see 👍
My only concern was silent failures, but since we're tracking them, that should be fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. I also don't like silent failures but also thought it wasn't super critical because we are at least tracking the failures.

I'd still prefer to be warned, though, rather than only becoming aware of an issue after a user complains that their archived repository is still being used for metrics.

We can use BullMQ's events to warn us about failures by sending a Slack message or email, for example. I'll make a note of this to speak with Joana about it and implement it next.

Copy link
Collaborator Author

@borfast borfast Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and if we want to get fancy, we can also install a dashboard for BullMQ. There are a few of them, like this, this, and this.

Copy link
Contributor

@mbani01 mbani01 Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great! Otherwise, if you're already storing them in the db, we can easily add a metaplane monitor since it's already configured to trigger Slack alerts (can be done in less than 5min), but not fancy 😅

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that would be awesome! Let's talk about that in Slack. It seems like a very good idea and something we could do without much work. 👍

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
…ve values

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Also add some more clarity to the readme and .env.example

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
@borfast borfast requested a review from themarolt as a code owner August 14, 2025 10:46
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
@@ -0,0 +1,87 @@
import dotenv from 'dotenv';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure if having a separate .env file just for this service and using dotenv is the correct way now that all other services just use the backend/.env... files and the backend-config config maps on kubernetes clusters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it's better to have everything in the backend-config ConfigMap, I can change it.

The idea for the .env file is mainly for someone to be able to run the service locally. For production it is not meant to be used, as the shared configuration details (e.g. database credentials), are still loaded from the backend-config ConfigMap, and there would be a separate ConfigMap with the specific things that only this service uses (e.g. batch settings, or the GitHub and Gitlab tokens created just for this)

I figured I should not pollute the global ConfigMap for everyone byt adding the specific things there, if no other service will use them. Also, the principles of least privilege and data segregation would say that each service should only have access to the minimum amount of data and permissions, and putting everything in the same ConfigMap grants every service access to everything.

But I actually forgot to include this, so I was just working on that now. I pushed a couple of commits that clarify this in this repo, and another commit in the crowd-kube repo with a ConfigMap for this service.

That said, if you think it's better to put everything in the same ConfigMap, I can easily change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's gonna be clearer if everything is in one place as is the case now. cc @anilb0stanci what do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, no worries, I'll merge them.

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
@borfast borfast requested a review from themarolt August 14, 2025 11:38
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants