Skip to content

Simplify permission checks for creating namespaces #1214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

ilongin
Copy link
Contributor

@ilongin ilongin commented Jul 8, 2025

Trying to simplify logic around permissions like "is creation of namespace / project allowed or not" by lifting awareness of "is the process run in CLI or Studio" to Catalog class.

This allowed to remove couple of methods from metastore:

  • is_studio
  • is_local_dataset
  • namespace_allowed_to_create
  • project_allowed_to_create

Summary by Sourcery

Simplify permission checks by replacing metastore-based flags with a centralized Catalog.is_cli indicator, update CLI commands to use it for routing between local and Studio operations, refactor loader and save logic accordingly, and align tests with the new mechanism.

Enhancements:

  • Introduce Catalog.is_cli flag and remove deprecated metastore properties (is_studio, is_local_dataset, namespace_allowed_to_create, project_allowed_to_create).
  • Update CLI dataset commands (rm_dataset, edit_dataset, delete_dataset) to use catalog.is_cli for permission and routing logic.
  • Refactor loader to determine is_cli based on metastore type and propagate it through the Catalog constructor.
  • Restrict automatic project creation in save logic when running in CLI mode.

Tests:

  • Replace allow_create_project and allow_create_namespace fixtures with a single is_cli fixture and mock Catalog.is_cli.
  • Update parametrized tests to use is_cli instead of legacy allow-create flags.

Summary by Sourcery

Centralize environment awareness by adding Catalog.is_cli and removing legacy metastore flags, refactor related dataset, namespace, and project commands to use the new flag, and align tests with the updated permission mechanism

Enhancements:

  • Introduce Catalog.is_cli property and remove deprecated metastore permission flags and methods
  • Refactor catalog loader to set is_cli based on metastore implementation
  • Update save, namespace, and project creation logic to restrict entity creation in CLI mode
  • Refactor CLI dataset commands to use Catalog.is_cli for routing between local and Studio operations

Tests:

  • Replace allow_create_project/allow_create_namespace fixtures with a unified is_cli fixture mocking Catalog.is_cli
  • Update parametrized tests across unit and functional suites to drive behavior via is_cli

Copy link
Contributor

sourcery-ai bot commented Jul 8, 2025

Reviewer's Guide

Introduces a centralized Catalog.is_cli flag to consolidate environment context, removes deprecated metastore permission properties, refactors CLI commands and loader to route between local and Studio operations based on is_cli, updates save and creation workflows to enforce CLI restrictions, and aligns tests with the new mechanism.

Sequence diagram for dataset removal with new is_cli logic

sequenceDiagram
    actor User
    participant CLI
    participant Catalog
    participant Config
    participant Studio
    participant Metastore
    User->>CLI: rm_dataset(...)
    CLI->>Catalog: get_full_dataset_name(name)
    CLI->>Catalog: is_cli
    alt is_cli and studio
        CLI->>Config: read studio token
        alt token exists
            CLI->>Studio: remove_studio_dataset(...)
        else token missing
            CLI->>CLI: raise DataChainError
        end
    else
        CLI->>Metastore: get_project(...)
        CLI->>Catalog: edit local dataset
    end
Loading

Class diagram for Catalog and Metastore permission refactor

classDiagram
    class Catalog {
        - _is_cli: bool
        + is_cli: bool
    }
    class AbstractMetastore {
        <<abstract>>
        - Removed: is_studio: bool
        - Removed: is_local_dataset(dataset_namespace: str): bool
        - Removed: namespace_allowed_to_create: bool
        - Removed: project_allowed_to_create: bool
    }
    Catalog --> AbstractMetastore : metastore
    class SQLiteMetastore {
        // No longer implements is_studio
    }
    AbstractMetastore <|-- SQLiteMetastore
Loading

Class diagram for loader and Catalog instantiation changes

classDiagram
    class Loader {
        + get_catalog(...): Catalog
    }
    class Catalog {
        + is_cli: bool
    }
    Loader --> Catalog : returns
    class SQLiteMetastore
    Loader ..> SQLiteMetastore : uses for is_cli detection
Loading

Class diagram for namespace and project creation permission checks

classDiagram
    class Session {
        + catalog: Catalog
    }
    class Catalog {
        + is_cli: bool
    }
    class Namespace {
        + validate_name(name)
    }
    class Project {
        + validate_name(name)
    }
    Session --> Catalog
    Namespace ..> Session : uses
    Project ..> Session : uses
    Namespace ..> Catalog : checks is_cli for permission
    Project ..> Catalog : checks is_cli for permission
Loading

File-Level Changes

Change Details Files
Centralize environment context using Catalog.is_cli
  • Add is_cli parameter to Catalog constructor and expose via is_cli property
  • Determine and pass is_cli based on metastore type in loader
  • Remove deprecated metastore properties (is_studio, is_local_dataset, namespace_allowed_to_create, project_allowed_to_create)
  • Remove is_studio override in SQLiteMetastore
src/datachain/catalog/catalog.py
src/datachain/catalog/loader.py
src/datachain/data_storage/metastore.py
src/datachain/data_storage/sqlite.py
Route dataset CLI commands through catalog.is_cli instead of metastore flags
  • Replace metastore.is_local_dataset checks with catalog.is_cli in rm_dataset and edit_dataset
  • Conditionally invoke Studio API commands based on catalog.is_cli and token presence
  • Remove redundant token retrieval and Studio call duplication
src/datachain/cli/commands/datasets.py
src/datachain/lib/dc/datasets.py
Update save, namespace and project creation to enforce CLI restrictions
  • Use not catalog.is_cli instead of metastore.project_allowed_to_create for save logic
  • Raise NamespaceCreateNotAllowedError and ProjectCreateNotAllowedError when session.catalog.is_cli is true
src/datachain/lib/dc/datachain.py
src/datachain/lib/namespaces.py
src/datachain/lib/projects.py
Simplify test fixtures to use is_cli and mock Catalog.is_cli
  • Replace allow_create_project/allow_create_namespace fixtures with a single is_cli fixture
  • Patch Catalog.is_cli in tests instead of AbstractMetastore flags
  • Update pytest.mark.parametrize decorators to use is_cli across unit and functional tests
tests/conftest.py
tests/unit/lib/test_datachain.py
tests/unit/lib/test_namespace.py
tests/unit/lib/test_project.py
tests/func/test_read_dataset_remote.py
tests/func/test_datasets.py
tests/func/test_pull.py
tests/test_cli_studio.py
Refine remote fallback logic using is_cli
  • Compute is_local based on catalog.is_cli and default namespace instead of metastore.is_local_dataset
  • Use is_local to drive remote fallback and error raising in get_dataset_with_remote_fallback
src/datachain/catalog/catalog.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ilongin ilongin marked this pull request as draft July 8, 2025 20:51
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ilongin - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/lib/namespaces.py:31` </location>
<code_context>
     """
     session = Session.get(session)

-    if not session.catalog.metastore.namespace_allowed_to_create:
+    if session.catalog.is_cli:
         raise NamespaceCreateNotAllowedError("Creating namespace is not allowed")

</code_context>

<issue_to_address>
Logic for namespace creation restriction appears inverted.

This change may block namespace creation in CLI mode even when permitted. Please verify that this aligns with the intended behavior for both CLI and Studio environments.
</issue_to_address>

### Comment 2
<location> `src/datachain/lib/projects.py:35` </location>
<code_context>
     """
     session = Session.get(session)

-    if not session.catalog.metastore.project_allowed_to_create:
+    if session.catalog.is_cli:
         raise ProjectCreateNotAllowedError("Creating project is not allowed")

</code_context>

<issue_to_address>
Project creation restriction logic may be reversed.

The updated condition blocks project creation in CLI mode, which may not be intended. Confirm that this matches the desired permission logic for CLI and Studio environments.
</issue_to_address>

### Comment 3
<location> `src/datachain/lib/dc/datasets.py:346` </location>
<code_context>
 ):
     namespace_name, project_name, name = catalog.get_full_dataset_name(name)

-    if not catalog.metastore.is_local_dataset(namespace_name) and studio:
+    if catalog.is_cli and studio:
+        # removing Studio dataset from CLI
         from datachain.studio import remove_studio_dataset
</code_context>

<issue_to_address>
Dataset deletion logic now depends on CLI mode rather than dataset locality.

Please verify that this change aligns with the intended behavior and that all relevant scenarios are covered.
</issue_to_address>

### Comment 4
<location> `src/datachain/catalog/catalog.py:530` </location>
<code_context>
             Callable[["AbstractWarehouse"], None]
         ] = None,
         in_memory: bool = False,
+        is_cli: Optional[bool] = True,
     ):
         datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
</code_context>

<issue_to_address>
Defaulting is_cli to True may not always reflect the actual environment.

This could cause incorrect behavior if Catalog is used outside a CLI context. Recommend setting is_cli explicitly where Catalog is instantiated or inferring it from the metastore type.

Suggested implementation:

```python
        in_memory: bool = False,
        is_cli: Optional[bool] = None,
    ):
        datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
        datachain_dir.init()
        }
        self._warehouse_ready_callback = warehouse_ready_callback
        self.in_memory = in_memory

        # Infer is_cli from metastore type if not explicitly provided
        if is_cli is not None:
            self._is_cli = is_cli
        else:
            # Example inference: if metastore is a CLI metastore, set True, else False
            # You may need to adjust this logic based on your actual metastore types
            self._is_cli = hasattr(self.metastore, "is_cli") and getattr(self.metastore, "is_cli", False)

```

- You will need to ensure that all instantiations of this class in your codebase are updated to explicitly pass `is_cli` if inference is not desired or not possible.
- Adjust the inference logic as needed to match your actual metastore implementation and types.
</issue_to_address>

### Comment 5
<location> `src/datachain/catalog/loader.py:140` </location>
<code_context>
     we can provide DATACHAIN_METASTORE_ARG_TEAM_ID=12345 env variable.
     """
     from datachain.catalog import Catalog
+    from datachain.data_storage.sqlite import SQLiteMetastore

+    metastore = get_metastore(in_memory=in_memory)
     return Catalog(
-        metastore=get_metastore(in_memory=in_memory),
+        metastore=metastore,
         warehouse=get_warehouse(in_memory=in_memory),
         client_config=client_config,
         in_memory=in_memory,
+        is_cli=isinstance(metastore, SQLiteMetastore),
     )
</code_context>

<issue_to_address>
CLI mode is inferred from metastore type, which may not be robust.

Relying on isinstance for CLI mode detection may break if new metastore types are added or if SQLiteMetastore is used outside CLI. Please use a more explicit check for CLI mode.

Suggested implementation:

```python
    metastore = get_metastore(in_memory=in_memory)
    # Explicitly determine CLI mode, e.g., via a function argument or environment variable
    is_cli = bool(os.environ.get("DATACHAIN_CLI_MODE", False))
    return Catalog(
        metastore=metastore,
        warehouse=get_warehouse(in_memory=in_memory),
        client_config=client_config,
        in_memory=in_memory,
        is_cli=is_cli,
    )

```

- You will need to import `os` at the top of the file if it is not already imported.
- If CLI mode should be determined differently (e.g., via a function argument), adjust the assignment to `is_cli` accordingly and update the function signature and all call sites.
</issue_to_address>

### Comment 6
<location> `src/datachain/lib/dc/datachain.py:585` </location>
<code_context>
             project = self.session.catalog.metastore.get_project(
                 project_name,
                 namespace_name,
-                create=self.session.catalog.metastore.project_allowed_to_create,
+                create=not self.session.catalog.is_cli,
             )
         except ProjectNotFoundError as e:
</code_context>

<issue_to_address>
Project creation flag is now inverted based on CLI mode.

This change may prevent project creation in CLI mode, which differs from the previous behavior. Please verify if this aligns with the intended permissions.
</issue_to_address>

### Comment 7
<location> `tests/unit/lib/test_namespace.py:29` </location>
<code_context>
-@pytest.mark.parametrize("allow_create_namespace", [False])
+@pytest.mark.parametrize("is_cli", [True])
 @skip_if_not_sqlite
-def test_create_by_user_not_allowed(test_session, allow_create_namespace):
+def test_create_by_user_not_allowed(test_session, is_cli):
     with pytest.raises(NamespaceCreateNotAllowedError) as excinfo:
         create_namespace("dev", session=test_session)
</code_context>

<issue_to_address>
Test for namespace creation denial is preserved and updated.

Consider adding a test for when 'is_cli' is False to verify that namespace creation is permitted in that case.
</issue_to_address>

### Comment 8
<location> `tests/unit/lib/test_project.py:65` </location>
<code_context>
         )


-@pytest.mark.parametrize("allow_create_project", [False])
+@pytest.mark.parametrize("is_cli", [True])
 @skip_if_not_sqlite
-def test_save_create_project_not_allowed(test_session, allow_create_project):
</code_context>

<issue_to_address>
Test for project creation denial updated to use 'is_cli'.

Please also add a test for when 'is_cli' is False to confirm project creation is allowed in that scenario.
</issue_to_address>

### Comment 9
<location> `tests/unit/lib/test_datachain.py:3591` </location>
<code_context>
         )


-@pytest.mark.parametrize("allow_create_project", [False])
+@pytest.mark.parametrize("is_cli", [True])
 @skip_if_not_sqlite
-def test_save_create_project_not_allowed(test_session, allow_create_project):
</code_context>

<issue_to_address>
Test for project creation not allowed updated to use 'is_cli'.

Please add a test for when 'is_cli' is False to ensure both allowed and not allowed cases are covered.

Suggested implementation:

```python
@pytest.mark.parametrize("is_cli", [True, False])
@skip_if_not_sqlite
def test_save_create_project_not_allowed(test_session, is_cli):
    if is_cli:
        with pytest.raises(ProjectCreateNotAllowedError):
            dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
                "dev.numbers.fibonacci"
            )
    else:
        # Should succeed when project creation is allowed
        result = dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
            "dev.numbers.fibonacci"
        )
        assert result is not None

```

- Ensure that the `dc` object and the `save` method are correctly set up to respect the `is_cli` parameter in your actual implementation.
- Adjust the assertion for the allowed case (`is_cli=False`) if there is a more specific expected result than just `result is not None`.
</issue_to_address>

### Comment 10
<location> `tests/unit/lib/test_datachain.py:3226` </location>
<code_context>


 @pytest.mark.parametrize("force", (True, False))
+@pytest.mark.parametrize("is_cli", (True,))
 @skip_if_not_sqlite
 def test_delete_dataset_from_studio(test_session, studio_token, requests_mock, force):
</code_context>

<issue_to_address>
Studio dataset deletion tests parameterized with 'is_cli'.

Please add tests for 'is_cli=False' to cover the non-Studio deletion path as well.

Suggested implementation:

```python
@pytest.mark.parametrize("force", (True, False))
@pytest.mark.parametrize("is_cli", (True, False))
@skip_if_not_sqlite
def test_delete_dataset_from_studio(test_session, studio_token, requests_mock, force):

```

```python
@pytest.mark.parametrize("is_cli", (True, False))
@skip_if_not_sqlite
def test_delete_dataset_from_studio_not_found(
    test_session, studio_token, requests_mock

```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@@ -32,7 +32,7 @@ def create(
"""
session = Session.get(session)

if not session.catalog.metastore.project_allowed_to_create:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Project creation restriction logic may be reversed.

The updated condition blocks project creation in CLI mode, which may not be intended. Confirm that this matches the desired permission logic for CLI and Studio environments.

@@ -527,6 +527,7 @@ def __init__(
Callable[["AbstractWarehouse"], None]
] = None,
in_memory: bool = False,
is_cli: Optional[bool] = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Defaulting is_cli to True may not always reflect the actual environment.

This could cause incorrect behavior if Catalog is used outside a CLI context. Recommend setting is_cli explicitly where Catalog is instantiated or inferring it from the metastore type.

Suggested implementation:

        in_memory: bool = False,
        is_cli: Optional[bool] = None,
    ):
        datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
        datachain_dir.init()
        }
        self._warehouse_ready_callback = warehouse_ready_callback
        self.in_memory = in_memory

        # Infer is_cli from metastore type if not explicitly provided
        if is_cli is not None:
            self._is_cli = is_cli
        else:
            # Example inference: if metastore is a CLI metastore, set True, else False
            # You may need to adjust this logic based on your actual metastore types
            self._is_cli = hasattr(self.metastore, "is_cli") and getattr(self.metastore, "is_cli", False)
  • You will need to ensure that all instantiations of this class in your codebase are updated to explicitly pass is_cli if inference is not desired or not possible.
  • Adjust the inference logic as needed to match your actual metastore implementation and types.

Comment on lines 140 to 148
from datachain.data_storage.sqlite import SQLiteMetastore

metastore = get_metastore(in_memory=in_memory)
return Catalog(
metastore=get_metastore(in_memory=in_memory),
metastore=metastore,
warehouse=get_warehouse(in_memory=in_memory),
client_config=client_config,
in_memory=in_memory,
is_cli=isinstance(metastore, SQLiteMetastore),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: CLI mode is inferred from metastore type, which may not be robust.

Relying on isinstance for CLI mode detection may break if new metastore types are added or if SQLiteMetastore is used outside CLI. Please use a more explicit check for CLI mode.

Suggested implementation:

    metastore = get_metastore(in_memory=in_memory)
    # Explicitly determine CLI mode, e.g., via a function argument or environment variable
    is_cli = bool(os.environ.get("DATACHAIN_CLI_MODE", False))
    return Catalog(
        metastore=metastore,
        warehouse=get_warehouse(in_memory=in_memory),
        client_config=client_config,
        in_memory=in_memory,
        is_cli=is_cli,
    )
  • You will need to import os at the top of the file if it is not already imported.
  • If CLI mode should be determined differently (e.g., via a function argument), adjust the assignment to is_cli accordingly and update the function signature and all call sites.

@skip_if_not_sqlite
def test_create_by_user_not_allowed(test_session, allow_create_namespace):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Test for namespace creation denial is preserved and updated.

Consider adding a test for when 'is_cli' is False to verify that namespace creation is permitted in that case.

@@ -62,7 +62,7 @@ def test_invalid_name(test_session, name):


@skip_if_not_sqlite
@pytest.mark.parametrize("allow_create_project", [False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Test for project creation denial updated to use 'is_cli'.

Please also add a test for when 'is_cli' is False to confirm project creation is allowed in that scenario.

@@ -3588,9 +3585,9 @@ def _full_name(namespace, project, name) -> str:
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Test for project creation not allowed updated to use 'is_cli'.

Please add a test for when 'is_cli' is False to ensure both allowed and not allowed cases are covered.

Suggested implementation:

@pytest.mark.parametrize("is_cli", [True, False])
@skip_if_not_sqlite
def test_save_create_project_not_allowed(test_session, is_cli):
    if is_cli:
        with pytest.raises(ProjectCreateNotAllowedError):
            dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
                "dev.numbers.fibonacci"
            )
    else:
        # Should succeed when project creation is allowed
        result = dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
            "dev.numbers.fibonacci"
        )
        assert result is not None
  • Ensure that the dc object and the save method are correctly set up to respect the is_cli parameter in your actual implementation.
  • Adjust the assertion for the allowed case (is_cli=False) if there is a more specific expected result than just result is not None.

Comment on lines 184 to 188
token = Config().read().get("studio", {}).get("token")
if not token:
raise DataChainError(
"Not logged in to Studio. Log in with 'datachain auth login'."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copy link

cloudflare-workers-and-pages bot commented Jul 9, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: dcdf924
Status: ✅  Deploy successful!
Preview URL: https://182cfe3d.datachain-documentation.pages.dev
Branch Preview URL: https://ilongin-1208-simplify-permis.datachain-documentation.pages.dev

View logs

Copy link

codecov bot commented Jul 9, 2025

Codecov Report

❌ Patch coverage is 93.10345% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.73%. Comparing base (9e21c31) to head (0b11e63).

Files with missing lines Patch % Lines
src/datachain/cli/commands/datasets.py 77.77% 0 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1214      +/-   ##
==========================================
- Coverage   88.73%   88.73%   -0.01%     
==========================================
  Files         155      155              
  Lines       14138    14136       -2     
  Branches     1997     1998       +1     
==========================================
- Hits        12546    12544       -2     
  Misses       1124     1124              
  Partials      468      468              
Flag Coverage Δ
datachain 88.67% <93.10%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/__init__.py 100.00% <ø> (ø)
src/datachain/catalog/catalog.py 85.99% <100.00%> (+0.03%) ⬆️
src/datachain/catalog/loader.py 75.00% <100.00%> (+0.31%) ⬆️
src/datachain/data_storage/metastore.py 93.75% <ø> (-0.12%) ⬇️
src/datachain/data_storage/sqlite.py 85.64% <ø> (-0.11%) ⬇️
src/datachain/lib/dc/__init__.py 100.00% <100.00%> (ø)
src/datachain/lib/dc/datachain.py 91.51% <ø> (ø)
src/datachain/lib/dc/datasets.py 95.12% <100.00%> (ø)
src/datachain/lib/dc/utils.py 83.87% <100.00%> (+0.82%) ⬆️
src/datachain/lib/namespaces.py 100.00% <100.00%> (ø)
... and 3 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilongin ilongin marked this pull request as ready for review July 12, 2025 22:58
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ilongin - I've reviewed your changes - here's some feedback:

  • Consider defaulting Catalog.is_cli to False (and overriding it only for CLI contexts in the loader) so that non-SQLite/metastore use-cases aren’t erroneously treated as CLI by default.
  • The repeated pattern of checking for a studio token and raising a DataChainError in CLI commands could be extracted into a helper to reduce duplication and improve readability.
  • Add a docstring or brief comment on the new is_cli property in Catalog to clearly document its intended semantics (CLI vs Studio) for future maintainers.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider defaulting Catalog.is_cli to False (and overriding it only for CLI contexts in the loader) so that non-SQLite/metastore use-cases aren’t erroneously treated as CLI by default.
- The repeated pattern of checking for a studio token and raising a DataChainError in CLI commands could be extracted into a helper to reduce duplication and improve readability.
- Add a docstring or brief comment on the new is_cli property in Catalog to clearly document its intended semantics (CLI vs Studio) for future maintainers.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ilongin ilongin mentioned this pull request Jul 14, 2025
2 tasks
@@ -1111,7 +1117,12 @@ def get_dataset_with_remote_fallback(
if version:
update = False

if self.metastore.is_local_dataset(namespace_name) or not update:
# local dataset is the one that is in Studio or in CLI but has default namespace
is_local = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this condition - how can we make it simpler, easier to read? (I'm also not sure I understand the comment above)

Copy link
Contributor Author

@ilongin ilongin Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before namespaces / projects we didn't have any kind of check here and in Studio we would still try to fetch from remote (Studio) if missing locally, which was wrong (it would fail with strange error like missing token).
Now we can determine if dataset we are fetching is from it's own (local) DB or can be fetched from remote / Studio if missing locally.

We never fallback to Studio if:

  1. Script is already ran in Studio
  2. Dataset starts with local.local.*

I agree this whole function is way to complex and confusing and I will try to refactor it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added better comment for that variable and changed name. I think it should be clear now.

@ilongin ilongin requested a review from shcheklein July 17, 2025 13:11
@@ -164,21 +167,22 @@ def edit_dataset(
):
namespace_name, project_name, name = catalog.get_full_dataset_name(name)

if catalog.metastore.is_local_dataset(namespace_name):
if namespace_name != catalog.metastore.default_namespace_name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's introduce a descriptive var - studio_dataset = ...

to make condisiton descriptive:

if studio_dataset:
....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(tbh still don't a lot this very non obvious way to detect it by analyzing namespaces, comparing it with default - one needs to know a lot about namespaces to understand this code and why it is correct - it is not obvious)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added is_studio_dataset var as you suggested.

Regarding second comment, the only way to fix this and make explicitly is to re-introduce studio flag but that's exactly what we tried to avoid in the first place with namespaces / projects - having those flags everywhere ... we wanted to be able to decide implicitly if some dataset is in Studio or local and that's what this code does .. trade-of is that code is less clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wonder if we can use another approach for this, like set env variable or use one of env variables we already have (like DATACHAIN_COMPUTE_CLUSTER_ID).

This is because it is a general task: to know if query was running from within Studio or CLI, @dmpetrov was asking about this specifically, and it would be great to have something like dc.is_studio flag/variable without instantiating whole Catalog, and Catalog itself can get this knowledge from the save variable seamless, without additional param in constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about this as well. Didn't want to introduce new env variable for now as I didn't know we need this in general as you mentioned but if we do need I can add as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added env variable called DATACHAIN_IS_STUDIO and function to check it, e.g

import datachain as dc
dc.is_studio()  # True or False

It's function as it's easier to handle and write tests (mock)

@ilongin ilongin requested a review from shcheklein August 20, 2025 11:29
@@ -164,21 +167,24 @@ def edit_dataset(
):
namespace_name, project_name, name = catalog.get_full_dataset_name(name)

if catalog.metastore.is_local_dataset(namespace_name):
# studio dataset is if it has non "local" namespace
is_studio_dataset = namespace_name != catalog.metastore.default_namespace_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about Studio dataset with a default global namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean if this will still work correctly? It should as it's different from local which is in catalog.metastore.default_namespace_name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catalog.metastore.default_namespace_name is it local even for Studio? how do we redefine it?

Copy link
Contributor Author

@ilongin ilongin Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I see this better. So this is part of the code that will only run in local (CLI). In Studio we don't need CLI, right.
Here we need to check if the dataset user want's to edit is his local dataset from .datachain folder or it's a remote Studio dataset which happens only if dataset's namespace is different from default one which is always local in local / CLI run. I could have hardcoded local here instead of catalog.metastore.default_namespace_name as well.

Note: this code would break if someone runs CLI in Studio - it would try to go to another Studio if namespace is different from global (default in Studio) ... but this code is never going to run from Studio in the first place - at least that wasn't ever the plan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put some assert that it is not Studio? (to document and prevent it from running)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check to not allow CLI from Studio environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s use the helper I mentioned above

@@ -445,6 +449,7 @@ def test_read_dataset_remote_update_flag_no_version(


@skip_if_not_sqlite
@pytest.mark.parametrize("is_studio", (False,))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does it exactly work?

pytest.mark.parametrize("is_studio", (False,))

what do we need this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, all tests run with Catalog.is_studio flag set to True. This is to allow creating namespaces / projects etc .. i.e. "studio" datachain allows more things, so to avoid the need to mock it explicitly to True all the time it's set to True by default in tests.
We can also change that with this fixture - set it to False, or both True and False, depending on a test need. This particular test is testing CLI so we need to set flag to False.
Logic around that fixture is in conftest.py: is_studio and mock_is_studio fixtures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial reaction - I'm lost :)

Second thought - two fixtures with negative condition, there gotta be a better / simpler way to do this that is clear when you read the code ...

Third thought (after reading conftest a bit):

  • why autouse for this fixture? can we just pass mock_is_studio where it is appropriate? That would a straightforward way to use mocks ... am I missing something?
  • @skip_if_not_sqlite can this one also define this ... or can it be replace with a single decorator?
  • why do we need to set False (studio) explicitly - shouldn't be it default if run it on sqlite only already?

Copy link
Contributor Author

@ilongin ilongin Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this is the problem:

  1. Default value for Catalog.is_studio is False
  2. There are a lot of scenarios, mostly related to having custom namespace / project in tests, for which we need Catalog.is_studio to be True
  3. There are rare scenarios / tests where we need Catalog.is_studio to be explicitly False (CLI methods testing)
  4. With 2) and 3) in mind it's better to have that set to True by default and have special fixture to set it to False on those rare tests when needed then other way around (we would then need to repeat fixture in a lot of tests to set that flag to True)
  5. We don't want to go into direction of splitting tests for CLI and Studio like we did with @skip_if_not_sqlite fixture.

Solution was to by default set Catalog.is_studio to be True so that we are not limited in terms of creating custom namespaces / projects in tests and this is achieved with:
mock_is_studio is autoused and it sets Catalog.is_studio to True unless value of other fixture is_studio is set to False - then we just yield which results in not setting flag to True which then stays False (default). This way we can set Catalog.is_studio to False just by adding is_studio fixture to be False on those rare tests where we need this scenario (CLI methods testing as mentioned above).

I know this is little bit complex but at the time I didn't have better idea. Maybe this will be simplified when I introduce ENV variable to control this is_studio thing, as Vlad suggested .. but on first thought I don't think it will change much.
If someone has better idea how to handle this I'm open to suggestion ... this is actually a general problem of having slightly different logic on Studio vs local and wanting to have non separated tests for them both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ilongin for a good explanation!

There are a lot of scenarios, mostly related to having custom namespace / project in tests, for which we need Catalog.is_studio to be True

do we need primarily for tests that pull datasets from Studio one way or another? or what are other major examples

if is_studio_dataset:
from datachain.studio import edit_studio_dataset

if Config().read().get("studio", {}).get("token"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: don't we have some util in StudioClient to do this? to get the token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but we def need to refactor this file as it duplicates these lines multiple times.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall .. some minor questions

# to fetch dataset with local.local namespace & project as that one cannot
# exist in Studio in the first place
no_fallback = (
is_studio() or namespace_name == self.metastore.default_namespace_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let,s please add a helper is-namespae-local with assert that it is “local” and with an explanation and with this namespace check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean instead of the second part only … to make more explanatory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants