feat: add the archived repositories checker [IN-551] #3320

borfast · 2025-08-11T14:45:21Z

Implements the recurring archived repository checker, for IN-551.

Since this was asked to be a cronjob, it doesn't use the usual Temporal structure. It is meant to be executed periodically with a Kubernetes CronJob resource (added on the crowd-kube repository) but can also be executed independently, if necessary.

Instead of Temporal, it uses BullMQ to manage the task queue, which is much simpler and runs locally, needing only Redis.

This can also serve as an example / template for other simple recurring tasks we want to run which don't require the complexity of Temporal, and maybe allows us to slowly start moving away from it to reduce costs and simplify things.

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

gitguardian · 2025-08-11T14:45:25Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
19935597	Triggered	Generic Password	`f0f0841`	services/cronjobs/archived_repositories/docker-compose.yaml	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

Copilot

Pull Request Overview

This PR introduces a new cronjob service for checking archived status of GitHub and GitLab repositories. The service uses BullMQ instead of Temporal for task queue management, providing a simpler alternative for recurring tasks that don't require Temporal's complexity.

Key changes:

Creates a new cronjob service architecture using BullMQ and Redis for task queuing
Implements rate-limited API clients for GitHub and GitLab to check repository archive status
Adds database migration to track last archived check timestamps

Reviewed Changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
services/cronjobs/archived_repositories/src/main.ts	Main orchestrator that batches repository URLs and enqueues jobs
services/cronjobs/archived_repositories/src/workers.ts	BullMQ workers with platform-specific rate limiting for API calls
services/cronjobs/archived_repositories/src/database.ts	Database operations for fetching URLs and updating repository status
services/cronjobs/archived_repositories/src/config.ts	Configuration management with environment variable validation
services/cronjobs/archived_repositories/src/clients/*.ts	API clients for GitHub and GitLab archive status checks
services/cronjobs/archived_repositories/src/utils.ts	URL parsing utility for extracting platform/owner/repo information
services/cronjobs/archived_repositories/src/types.ts	Type definitions and constants for the service
services/cronjobs/archived_repositories/package.json	Dependencies and build scripts for the cronjob service
backend/src/database/migrations/*.sql	Database migration adding last_archived_check column and index
pnpm-workspace.yaml	Excludes cronjobs from workspace to allow independent dependency management

Files not reviewed (1)

services/cronjobs/archived_repositories/pnpm-lock.yaml: Language not supported

services/cronjobs/archived_repositories/src/workers.ts

services/cronjobs/archived_repositories/src/main.ts

services/cronjobs/archived_repositories/src/config.ts

services/cronjobs/archived_repositories/src/clients/github.ts

services/cronjobs/archived_repositories/src/main.ts

services/cronjobs/archived_repositories/README.md

services/cronjobs/archived_repositories/src/config.ts

services/cronjobs/archived_repositories/src/main.ts

services/cronjobs/archived_repositories/src/utils.ts

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

…ardian Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

mbani01 · 2025-08-12T08:23:11Z

services/cronjobs/archived_repositories/src/clients/github.ts

+  const data = await ofetch(`https://api.github.com/repos/${owner}/${repo}`, {
+    headers: { Authorization: `Bearer ${config.GithubToken}` },
+  });
+


How are we handling unsuccessful responses here? I see there is an existing logic to limit the number of requests to avoid rate limits, but what will happen if we receive any other response besides 200?

Yesterday I asked Joana about the same thing, more specifically about when we get an error from GitHub (for example) saying the repository does not exist. We decided to add a new deleted column to track that but we didn't talk about the general error scenario.

BullMQ can automatically retry failed jobs depending on settings, so I would say we could use that as a first measure. After that, BullMQ puts all the failed jobs in a special failed queue, to which we can later add some monitoring if we want to.

Alright, I see 👍
My only concern was silent failures, but since we're tracking them, that should be fine

Yeah, I agree. I also don't like silent failures but also thought it wasn't super critical because we are at least tracking the failures.

I'd still prefer to be warned, though, rather than only becoming aware of an issue after a user complains that their archived repository is still being used for metrics.

We can use BullMQ's events to warn us about failures by sending a Slack message or email, for example. I'll make a note of this to speak with Joana about it and implement it next.

Oh, and if we want to get fancy, we can also install a dashboard for BullMQ. There are a few of them, like this, this, and this.

That would be great! Otherwise, if you're already storing them in the db, we can easily add a metaplane monitor since it's already configured to trigger Slack alerts (can be done in less than 5min), but not fancy 😅

Oh, that would be awesome! Let's talk about that in Slack. It seems like a very good idea and something we could do without much work. 👍

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

…ve values Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

Also add some more clarity to the readme and .env.example Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

themarolt · 2025-08-14T10:51:00Z

services/cronjobs/archived_repositories/src/config.ts

@@ -0,0 +1,87 @@
+import dotenv from 'dotenv';


Not really sure if having a separate .env file just for this service and using dotenv is the correct way now that all other services just use the backend/.env... files and the backend-config config maps on kubernetes clusters.

If you think it's better to have everything in the backend-config ConfigMap, I can change it.

The idea for the .env file is mainly for someone to be able to run the service locally. For production it is not meant to be used, as the shared configuration details (e.g. database credentials), are still loaded from the backend-config ConfigMap, and there would be a separate ConfigMap with the specific things that only this service uses (e.g. batch settings, or the GitHub and Gitlab tokens created just for this)

I figured I should not pollute the global ConfigMap for everyone byt adding the specific things there, if no other service will use them. Also, the principles of least privilege and data segregation would say that each service should only have access to the minimum amount of data and permissions, and putting everything in the same ConfigMap grants every service access to everything.

But I actually forgot to include this, so I was just working on that now. I pushed a couple of commits that clarify this in this repo, and another commit in the crowd-kube repo with a ConfigMap for this service.

That said, if you think it's better to put everything in the same ConfigMap, I can easily change it.

I think it's gonna be clearer if everything is in one place as is the case now. cc @anilb0stanci what do you think?

OK, no worries, I'll merge them.

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

feat: add the archived repositories checker

f0f0841

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

borfast requested review from ulemons, mbani01 and Copilot August 11, 2025 14:45

borfast self-assigned this Aug 11, 2025

Copilot AI reviewed Aug 11, 2025

View reviewed changes

github-advanced-security bot found potential problems Aug 11, 2025

View reviewed changes

borfast added 6 commits August 11, 2025 15:53

chore: remove some forgotten console.logs and fix a few other issues

69efb82

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: fix a wrong continue instead of return

c2bccf2

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: fix potentially insecure host comparison

a0d9d05

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: remove fake password from docker-compose.yaml to satisfy GitGu…

4d562dc

…ardian Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: add a tiny explanation to the readme file

b588522

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: add a tiny explanation to the readme file

3115c1c

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

ulemons approved these changes Aug 12, 2025

View reviewed changes

mbani01 reviewed Aug 12, 2025

View reviewed changes

chore: replace segmentrepositories with "segmentRepositories"

886b8f0

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

mbani01 approved these changes Aug 12, 2025

View reviewed changes

borfast added 4 commits August 12, 2025 15:05

chore: add exponential back-off to archived repos checker

ae5e644

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: add a mention of BullMQ's dashboards in README.md

9a025e1

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: change archived checker rate limit to slightly more conservati…

299bba6

…ve values Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: add Redis database number to config

60a16a3

Also add some more clarity to the readme and .env.example Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

borfast requested a review from themarolt as a code owner August 14, 2025 10:46

chore: add instructions on how to build and push the container image

7740acc

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

themarolt reviewed Aug 14, 2025

View reviewed changes

borfast added 2 commits August 14, 2025 11:51

chore: fix default values

235ae83

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: fix default values

259681e

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

borfast requested a review from themarolt August 14, 2025 11:38

borfast added 2 commits August 14, 2025 15:35

chore: don't reference separate configmaps

c54a239

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

fix: require encryption for postgres

433a54e

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

borfast added 4 commits August 14, 2025 16:40

chore: reduce the delay between batches

3a392f1

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

fix: fix Postgres SSL connection

a1bd305

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

fix: fix table name

bf72224

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

chore: update the new updatedAt column when a repository is checked

4597096

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>

feat: add the archived repositories checker [IN-551] #3320

Are you sure you want to change the base?

feat: add the archived repositories checker [IN-551] #3320

Conversation

borfast commented Aug 11, 2025

Uh oh!

gitguardian bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

borfast Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbani01 Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gitguardian bot commented Aug 11, 2025 •

edited

Loading

borfast Aug 12, 2025 •

edited

Loading

mbani01 Aug 12, 2025 •

edited

Loading