-
Notifications
You must be signed in to change notification settings - Fork 734
feat: data pipeline changes for segmentRepositories [IN-552] [IN-554] #3326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: data pipeline changes for segmentRepositories [IN-552] [IN-554] #3326
Conversation
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conventional Commits FTW!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements the infrastructure to propagate archived and excluded repository data from the segmentRepositories
table through the TinyBird pipeline to support enhanced project filtering and search capabilities.
- Adds database migration to enable replication of the
segmentRepositories
table to TinyBird - Creates a new
segmentRepositories
data source in TinyBird and updates pipeline to aggregate archived/excluded repository lists - Propagates archived repositories data through the insights project pipeline to enable repository-aware search functionality
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
backend/src/database/migrations/V1753798345__segmentRepositories_replication.sql | Enables replication of segmentRepositories table to TinyBird |
services/libs/tinybird/datasources/segmentRepositories.datasource | Defines TinyBird data source schema for segmentRepositories table |
services/libs/tinybird/datasources/insights_projects_populated_ds.datasource | Adds archived/excluded repositories fields to project data source |
services/libs/tinybird/pipes/insights_projects_populated_copy.pipe | Aggregates archived/excluded repositories and joins to project data |
services/libs/tinybird/pipes/insightsProjects_filtered.pipe | Passes through archived/excluded repositories to filtered results |
services/libs/tinybird/pipes/search_collections_projects_repos.pipe | Exposes archived repositories in search results |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
services/libs/tinybird/pipes/insights_projects_populated_copy.pipe
Outdated
Show resolved
Hide resolved
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
@@ -0,0 +1,3 @@ | |||
ALTER PUBLICATION sequin_pub ADD TABLE "segmentRepositories"; | |||
ALTER TABLE "segmentRepositories" REPLICA IDENTITY FULL; | |||
GRANT SELECT ON "segmentRepositories" to sequin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this grant - sequin user only works on the sequin schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't it need access to do the backfills? I confess I never checked how it does them, but I assumed it was just querying the database normally.
services/libs/tinybird/datasources/insights_projects_populated_ds.datasource
Outdated
Show resolved
Hide resolved
`createdAt` DateTime64(3) `json:$.record.createdAt`, | ||
`archived` Bool `json:$.record.archived`, | ||
`excluded` Bool `json:$.record.excluded`, | ||
`last_archived_check` Nullable(DateTime64(3)) `json:$.record.last_archived_check` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using Nullable fields, it's better to use Non-nullable fields with default values for performance reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column is a datetime and nullable on Postgres, and I think we might have had similar situations before - how are we converting null rows to non-null for Tinybird? Do we do a transformation in Sequin, for example?
ENGINE ReplacingMergeTree | ||
ENGINE_PARTITION_KEY toYear(createdAt) | ||
ENGINE_SORTING_KEY repository | ||
ENGINE_VER createdAt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need an updatedAt
field to keep track of different versions of the same row. createdAt
won't change on consecutive updates to the same row (when it's archived or excluded, for instance). So it can even pick the old row when an update happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, I forgot about that. The original table Umberto created doesn't have it and I wondered if it was necessary, but forgot to check it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the column and also updated the worker code to update its timestamp whenever it updates a repository archived / excluded status.
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
…olumns Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
Signed-off-by: Raúl Santos <4837+borfast@users.noreply.github.com>
In the context of IN-524, we want to have the archived and excluded repositories for each project returned along with the project data.
This means adding the repository data to TinyBird, which already includes the archived and excluded flags in the
segmentRepositories
table, and then merge that data into the existing pipes that return project data, namely theprojects_list
and thesearch_collections_projects_repos
pipes.This PR:
segmentRepositories
table into Sequin.segmentRepositories
data source in TinyBird to receive data from the aforementioned table.insights_projects_populated_copy
pipe to include the list of archived and excluded repositories, which is then propagated to theinsights_projects_populated_ds
data source, from there it feeds theinsightsProjects_filtered
pipe, and finally theprojects_list
pipe, which is the end goal.search_collections_projects_repos
containsarchived
andexcluded
columns for the repositories results. This required adding aJOIN
to the query, and with all the optimisations that were done not too long ago, I'm not sure if thisSpecific tickets in Jira: