Skip to content

feat: data pipeline changes for segmentRepositories [IN-552] [IN-554] #3326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

borfast
Copy link
Collaborator

@borfast borfast commented Aug 12, 2025

In the context of IN-524, we want to have the archived and excluded repositories for each project returned along with the project data.

This means adding the repository data to TinyBird, which already includes the archived and excluded flags in the segmentRepositories table, and then merge that data into the existing pipes that return project data, namely the projects_list and the search_collections_projects_repos pipes.

This PR:

  • Adds the database migration to ensure we get data from the segmentRepositories table into Sequin.
  • Adds a new segmentRepositories data source in TinyBird to receive data from the aforementioned table.
  • Modifies the insights_projects_populated_copy pipe to include the list of archived and excluded repositories, which is then propagated to the insights_projects_populated_ds data source, from there it feeds the insightsProjects_filtered pipe, and finally the projects_list pipe, which is the end goal.
    • The search_collections_projects_repos contains archived and excluded columns for the repositories results. This required adding a JOIN to the query, and with all the optimisations that were done not too long ago, I'm not sure if this

Specific tickets in Jira:

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
@borfast borfast requested a review from Copilot August 12, 2025 22:29
@borfast borfast self-assigned this Aug 12, 2025
@borfast borfast requested a review from epipav as a code owner August 12, 2025 22:29
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventional Commits FTW!

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the infrastructure to propagate archived and excluded repository data from the segmentRepositories table through the TinyBird pipeline to support enhanced project filtering and search capabilities.

  • Adds database migration to enable replication of the segmentRepositories table to TinyBird
  • Creates a new segmentRepositories data source in TinyBird and updates pipeline to aggregate archived/excluded repository lists
  • Propagates archived repositories data through the insights project pipeline to enable repository-aware search functionality

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
backend/src/database/migrations/V1753798345__segmentRepositories_replication.sql Enables replication of segmentRepositories table to TinyBird
services/libs/tinybird/datasources/segmentRepositories.datasource Defines TinyBird data source schema for segmentRepositories table
services/libs/tinybird/datasources/insights_projects_populated_ds.datasource Adds archived/excluded repositories fields to project data source
services/libs/tinybird/pipes/insights_projects_populated_copy.pipe Aggregates archived/excluded repositories and joins to project data
services/libs/tinybird/pipes/insightsProjects_filtered.pipe Passes through archived/excluded repositories to filtered results
services/libs/tinybird/pipes/search_collections_projects_repos.pipe Exposes archived repositories in search results

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
@borfast borfast changed the title Sata pipeline changes for segmentRepositories [IN-552] feat: data pipeline changes for segmentRepositories [IN-552] [IN-554] Aug 12, 2025
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
@@ -0,0 +1,3 @@
ALTER PUBLICATION sequin_pub ADD TABLE "segmentRepositories";
ALTER TABLE "segmentRepositories" REPLICA IDENTITY FULL;
GRANT SELECT ON "segmentRepositories" to sequin;
Copy link
Collaborator

@epipav epipav Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this grant - sequin user only works on the sequin schema

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it need access to do the backfills? I confess I never checked how it does them, but I assumed it was just querying the database normally.

`createdAt` DateTime64(3) `json:$.record.createdAt`,
`archived` Bool `json:$.record.archived`,
`excluded` Bool `json:$.record.excluded`,
`last_archived_check` Nullable(DateTime64(3)) `json:$.record.last_archived_check`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using Nullable fields, it's better to use Non-nullable fields with default values for performance reasons

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column is a datetime and nullable on Postgres, and I think we might have had similar situations before - how are we converting null rows to non-null for Tinybird? Do we do a transformation in Sequin, for example?

ENGINE ReplacingMergeTree
ENGINE_PARTITION_KEY toYear(createdAt)
ENGINE_SORTING_KEY repository
ENGINE_VER createdAt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need an updatedAt field to keep track of different versions of the same row. createdAt won't change on consecutive updates to the same row (when it's archived or excluded, for instance). So it can even pick the old row when an update happens

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, I forgot about that. The original table Umberto created doesn't have it and I wondered if it was necessary, but forgot to check it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the column and also updated the worker code to update its timestamp whenever it updates a repository archived / excluded status.

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
…olumns

Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
@borfast borfast requested a review from themarolt as a code owner August 16, 2025 11:31
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants