Skip to content

Add database backup functionality with GCS integration #297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

domdomegg
Copy link
Member

@domdomegg domdomegg commented Aug 22, 2025

Implements automated database backup functionality with Google Cloud Storage integration, including retention policies and MinIO support for local development.

Motivation and Context

This change adds critical database backup capabilities to ensure data durability and disaster recovery. The solution provides automated backups with configurable retention policies and supports both production (GCS) and development (MinIO) environments.

How Has This Been Tested?

  • Tested backup creation and restoration with MinIO in local development
  • Verified GCS bucket lifecycle policies for automatic deletion after 60 days
  • Tested backup retention and cleanup logic

Breaking Changes

No breaking changes. This is an additive feature that doesn't affect existing functionality.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

  • Implements 60-day retention period for backups as a safety net
  • Uses GCS lifecycle rules for automatic cleanup
  • Includes MinIO setup instructions for local development testing
  • Port-forwarding commands updated for consistency across documentation
  • Fixes Backup cadence implemented for database #184

domdomegg and others added 5 commits August 22, 2025 14:02
🏠 Remote-Dev: homespace
🏠 Remote-Dev: homespace
@domdomegg
Copy link
Member Author

@claude can you review?

Copy link
Contributor

claude bot commented Aug 22, 2025

Claude finished @domdomegg's task —— View job


PR Review: Database Backup Functionality

Comprehensive review of the database backup implementation with GCS integration and K8up operator.

✅ Overall Assessment

This is a well-implemented backup solution that adds critical data durability to the MCP Registry. The architecture is sound, using industry-standard tools (K8up/Restic) with proper cloud integration.

Key Strengths:

  • Clean provider pattern supporting both GCP production and local development
  • Proper retention policies with safety nets
  • Good separation of concerns in the codebase
  • Comprehensive documentation updates
  • Environment-specific configurations
🔧 Technical Implementation

Backup Architecture

  • K8up operator: Kubernetes-native backup solution using Restic
  • Dual storage: GCS for production, MinIO for local development
  • S3-compatible interface: Unified approach across environments
  • Scheduled operations: Daily backups at 4:46 AM, pruning at 5:46 AM

Code Quality

  • Well-structured provider pattern in deploy/pkg/providers/
  • Proper dependency management with Pulumi resources
  • Good error handling and resource cleanup
  • Environment-specific configurations (frequent backups for testing)
⚠️ Security & Reliability Concerns

Issues Found:

  1. Hardcoded Credentials in Local Provider (deploy/pkg/providers/local/provider.go:63):

    StringData: pulumi.StringMap{
        "password": pulumi.String("password"), // In production we use GCS, which is already encrypted
    • Uses weak password "password" for repository encryption
    • Comment acknowledges GCS encryption but local backups still need protection
  2. Default MinIO Credentials (deploy/pkg/providers/local/provider.go:69-70, 260-261):

    "accesskey": pulumi.String("minioadmin"),
    "secretkey": pulumi.String("minioadmin"),
    • Uses default credentials that are well-known
  3. Root Container Execution (deploy/pkg/k8s/backup.go:115):

    podSecurityContext: map[string]any{
        "runAsUser": 0, // Run as root to access all files
    • While necessary for backup access, consider if more restrictive permissions are possible

Security Recommendations:

  • Generate random password for local Restic repository
  • Use configurable MinIO credentials with secure defaults
  • Document security implications of root execution requirement
🏗️ Design & Architecture

Well-Designed Aspects:

  • Provider abstraction: Clean interface allowing easy addition of new cloud providers
  • Dependency management: Proper resource dependencies ensure correct deployment order
  • Configuration flexibility: Environment-specific settings for development vs production
  • Retention strategy: 28-day K8up pruning + 60-day GCS safety net

Areas for Enhancement:

  • Consider making backup schedules configurable via Pulumi config
  • Add monitoring/alerting configuration for backup failures
  • Document disaster recovery procedures beyond the current restore steps
📚 Documentation

Excellent Documentation Updates:

  • Clear setup instructions for both local and production environments
  • Step-by-step backup access and restoration procedures
  • Troubleshooting section with helpful kubectl commands
  • Updated port-forwarding commands for consistency

Minor Suggestions:

  • Add example of monitoring backup job status
  • Consider adding a backup verification procedure
  • Document retention policy rationale (why 60 vs 28 days)
🔍 Code-Specific Issues

File: deploy/pkg/k8s/backup.go

  • Line 26: Hardcoded K8up version 4.8.4 - consider making configurable
  • Line 71-73: Schedule times use specific minutes (:46) - document rationale
  • Lines 75-79: Test schedules very aggressive (every minute) - might overwhelm test environments

File: deploy/pkg/providers/gcp/provider.go

  • Line 208: 60-day retention hardcoded - should align with configuration management approach
  • Line 268-269: HMAC credentials stored as StringMap - verify this is secure for production

File: deploy/pkg/providers/local/provider.go

  • Line 269: MinIO endpoint uses internal DNS name - ensure this works in all local environments
  • Lines 230-236: Bucket initialization script could benefit from better error handling
✨ Recommendations

Priority 1 (Security):

  1. Generate secure random password for Restic repository in local environments
  2. Make MinIO credentials configurable rather than using defaults
  3. Add warning documentation about security implications

Priority 2 (Robustness):

  1. Make K8up version and schedules configurable
  2. Add backup monitoring/alerting configuration
  3. Improve error handling in MinIO initialization

Priority 3 (Enhancement):

  1. Add backup verification procedures
  2. Consider backup encryption at rest documentation
  3. Add example disaster recovery runbook

Recommendation: Approve with minor security improvements

This is a solid implementation that significantly improves the registry's data durability. The security concerns are primarily related to local development environments and don't affect production deployment security. The architecture is well-designed and the documentation is comprehensive.


@domdomegg
Copy link
Member Author

Re Claude's review:

  1. Using default passwords in local dev environment seems fine, and simplifies setup + contributor flow for almost no security benefit IMO.
  2. This is configuration as code, it is already configurable! Monitoring+alerting maybe a good idea, but going to punt that as out of scope for now (can use kubectl in mean time). Minio spins up fine locally with current code, Pulumi already handles errors.
  3. Think these are out of scope for getting something basic working now.

🏠 Remote-Dev: homespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Backup cadence implemented for database
1 participant