Skip to content

feat: add bot detection in data ingestion #3347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

skwowet
Copy link
Member

@skwowet skwowet commented Aug 22, 2025

Changes proposed ✍️

Added bot detection logic to the data-sink-worker to automatically identify bot accounts during member creation. The detection system uses a three-tier approach:

  • Strong patterns for immediate bot confirmation, covering clear indicators such as [bot] notation, well-known bot services (e.g., dependabot, renovate, coderabbit), and other patterns.
  • Known bots list for specific accounts that cannot be consistently detected with regex alone.
  • Common patterns for broader automation keywords and platform prefixes, which are used to flag potential bots for LLM validation.

Bot flags provided by source integrations always take precedence over our detection logic, and suspected bots are explicitly flagged for LLM validation. Existing isBot values are preserved during member updates to ensure we do not overwrite integration-provided information.

Also, blacklistedDomains was converted from an array to a Set to improve lookup performance (O(1) vs O(n)).

@skwowet skwowet self-assigned this Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant