Simplify permission checks for creating namespaces #1214

ilongin · 2025-07-08T20:51:04Z

Trying to simplify logic around permissions like "is creation of namespace / project allowed or not" by lifting awareness of "is the process run in CLI or Studio" to Catalog class.

This allowed to remove couple of methods from metastore:

is_studio
is_local_dataset
namespace_allowed_to_create
project_allowed_to_create

Summary by Sourcery

Simplify permission checks by replacing metastore-based flags with a centralized Catalog.is_cli indicator, update CLI commands to use it for routing between local and Studio operations, refactor loader and save logic accordingly, and align tests with the new mechanism.

Enhancements:

Introduce Catalog.is_cli flag and remove deprecated metastore properties (is_studio, is_local_dataset, namespace_allowed_to_create, project_allowed_to_create).
Update CLI dataset commands (rm_dataset, edit_dataset, delete_dataset) to use catalog.is_cli for permission and routing logic.
Refactor loader to determine is_cli based on metastore type and propagate it through the Catalog constructor.
Restrict automatic project creation in save logic when running in CLI mode.

Tests:

Replace allow_create_project and allow_create_namespace fixtures with a single is_cli fixture and mock Catalog.is_cli.
Update parametrized tests to use is_cli instead of legacy allow-create flags.

Summary by Sourcery

Centralize environment awareness by adding Catalog.is_cli and removing legacy metastore flags, refactor related dataset, namespace, and project commands to use the new flag, and align tests with the updated permission mechanism

Enhancements:

Introduce Catalog.is_cli property and remove deprecated metastore permission flags and methods
Refactor catalog loader to set is_cli based on metastore implementation
Update save, namespace, and project creation logic to restrict entity creation in CLI mode
Refactor CLI dataset commands to use Catalog.is_cli for routing between local and Studio operations

Tests:

Replace allow_create_project/allow_create_namespace fixtures with a unified is_cli fixture mocking Catalog.is_cli
Update parametrized tests across unit and functional suites to drive behavior via is_cli

sourcery-ai · 2025-07-08T20:51:10Z

Reviewer's Guide

Introduces a centralized Catalog.is_cli flag to consolidate environment context, removes deprecated metastore permission properties, refactors CLI commands and loader to route between local and Studio operations based on is_cli, updates save and creation workflows to enforce CLI restrictions, and aligns tests with the new mechanism.

Sequence diagram for dataset removal with new is_cli logic

sequenceDiagram
    actor User
    participant CLI
    participant Catalog
    participant Config
    participant Studio
    participant Metastore
    User->>CLI: rm_dataset(...)
    CLI->>Catalog: get_full_dataset_name(name)
    CLI->>Catalog: is_cli
    alt is_cli and studio
        CLI->>Config: read studio token
        alt token exists
            CLI->>Studio: remove_studio_dataset(...)
        else token missing
            CLI->>CLI: raise DataChainError
        end
    else
        CLI->>Metastore: get_project(...)
        CLI->>Catalog: edit local dataset
    end

Class diagram for Catalog and Metastore permission refactor

classDiagram
    class Catalog {
        - _is_cli: bool
        + is_cli: bool
    }
    class AbstractMetastore {
        <<abstract>>
        - Removed: is_studio: bool
        - Removed: is_local_dataset(dataset_namespace: str): bool
        - Removed: namespace_allowed_to_create: bool
        - Removed: project_allowed_to_create: bool
    }
    Catalog --> AbstractMetastore : metastore
    class SQLiteMetastore {
        // No longer implements is_studio
    }
    AbstractMetastore <|-- SQLiteMetastore

Class diagram for loader and Catalog instantiation changes

classDiagram
    class Loader {
        + get_catalog(...): Catalog
    }
    class Catalog {
        + is_cli: bool
    }
    Loader --> Catalog : returns
    class SQLiteMetastore
    Loader ..> SQLiteMetastore : uses for is_cli detection

Class diagram for namespace and project creation permission checks

classDiagram
    class Session {
        + catalog: Catalog
    }
    class Catalog {
        + is_cli: bool
    }
    class Namespace {
        + validate_name(name)
    }
    class Project {
        + validate_name(name)
    }
    Session --> Catalog
    Namespace ..> Session : uses
    Project ..> Session : uses
    Namespace ..> Catalog : checks is_cli for permission
    Project ..> Catalog : checks is_cli for permission

File-Level Changes

Change	Details	Files
Centralize environment context using Catalog.is_cli	Add is_cli parameter to Catalog constructor and expose via is_cli property Determine and pass is_cli based on metastore type in loader Remove deprecated metastore properties (is_studio, is_local_dataset, namespace_allowed_to_create, project_allowed_to_create) Remove is_studio override in SQLiteMetastore	`src/datachain/catalog/catalog.py` `src/datachain/catalog/loader.py` `src/datachain/data_storage/metastore.py` `src/datachain/data_storage/sqlite.py`
Route dataset CLI commands through catalog.is_cli instead of metastore flags	Replace metastore.is_local_dataset checks with catalog.is_cli in rm_dataset and edit_dataset Conditionally invoke Studio API commands based on catalog.is_cli and token presence Remove redundant token retrieval and Studio call duplication	`src/datachain/cli/commands/datasets.py` `src/datachain/lib/dc/datasets.py`
Update save, namespace and project creation to enforce CLI restrictions	Use not catalog.is_cli instead of metastore.project_allowed_to_create for save logic Raise NamespaceCreateNotAllowedError and ProjectCreateNotAllowedError when session.catalog.is_cli is true	`src/datachain/lib/dc/datachain.py` `src/datachain/lib/namespaces.py` `src/datachain/lib/projects.py`
Simplify test fixtures to use is_cli and mock Catalog.is_cli	Replace allow_create_project/allow_create_namespace fixtures with a single is_cli fixture Patch Catalog.is_cli in tests instead of AbstractMetastore flags Update pytest.mark.parametrize decorators to use is_cli across unit and functional tests	`tests/conftest.py` `tests/unit/lib/test_datachain.py` `tests/unit/lib/test_namespace.py` `tests/unit/lib/test_project.py` `tests/func/test_read_dataset_remote.py` `tests/func/test_datasets.py` `tests/func/test_pull.py` `tests/test_cli_studio.py`
Refine remote fallback logic using is_cli	Compute is_local based on catalog.is_cli and default namespace instead of metastore.is_local_dataset Use is_local to drive remote fallback and error raising in get_dataset_with_remote_fallback	`src/datachain/catalog/catalog.py`

Possibly linked issues

remove docstring from DataModel.__pydantic__init_subclass__ #123: PR simplifies namespace/project creation permission checks by removing deprecated metastore methods, directly addressing the issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @ilongin - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/lib/namespaces.py:31` </location>
<code_context>
     """
     session = Session.get(session)

-    if not session.catalog.metastore.namespace_allowed_to_create:
+    if session.catalog.is_cli:
         raise NamespaceCreateNotAllowedError("Creating namespace is not allowed")

</code_context>

<issue_to_address>
Logic for namespace creation restriction appears inverted.

This change may block namespace creation in CLI mode even when permitted. Please verify that this aligns with the intended behavior for both CLI and Studio environments.
</issue_to_address>

### Comment 2
<location> `src/datachain/lib/projects.py:35` </location>
<code_context>
     """
     session = Session.get(session)

-    if not session.catalog.metastore.project_allowed_to_create:
+    if session.catalog.is_cli:
         raise ProjectCreateNotAllowedError("Creating project is not allowed")

</code_context>

<issue_to_address>
Project creation restriction logic may be reversed.

The updated condition blocks project creation in CLI mode, which may not be intended. Confirm that this matches the desired permission logic for CLI and Studio environments.
</issue_to_address>

### Comment 3
<location> `src/datachain/lib/dc/datasets.py:346` </location>
<code_context>
 ):
     namespace_name, project_name, name = catalog.get_full_dataset_name(name)

-    if not catalog.metastore.is_local_dataset(namespace_name) and studio:
+    if catalog.is_cli and studio:
+        # removing Studio dataset from CLI
         from datachain.studio import remove_studio_dataset
</code_context>

<issue_to_address>
Dataset deletion logic now depends on CLI mode rather than dataset locality.

Please verify that this change aligns with the intended behavior and that all relevant scenarios are covered.
</issue_to_address>

### Comment 4
<location> `src/datachain/catalog/catalog.py:530` </location>
<code_context>
             Callable[["AbstractWarehouse"], None]
         ] = None,
         in_memory: bool = False,
+        is_cli: Optional[bool] = True,
     ):
         datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
</code_context>

<issue_to_address>
Defaulting is_cli to True may not always reflect the actual environment.

This could cause incorrect behavior if Catalog is used outside a CLI context. Recommend setting is_cli explicitly where Catalog is instantiated or inferring it from the metastore type.

Suggested implementation:

```python
        in_memory: bool = False,
        is_cli: Optional[bool] = None,
    ):
        datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
        datachain_dir.init()
        }
        self._warehouse_ready_callback = warehouse_ready_callback
        self.in_memory = in_memory

        # Infer is_cli from metastore type if not explicitly provided
        if is_cli is not None:
            self._is_cli = is_cli
        else:
            # Example inference: if metastore is a CLI metastore, set True, else False
            # You may need to adjust this logic based on your actual metastore types
            self._is_cli = hasattr(self.metastore, "is_cli") and getattr(self.metastore, "is_cli", False)

```

- You will need to ensure that all instantiations of this class in your codebase are updated to explicitly pass `is_cli` if inference is not desired or not possible.
- Adjust the inference logic as needed to match your actual metastore implementation and types.
</issue_to_address>

### Comment 5
<location> `src/datachain/catalog/loader.py:140` </location>
<code_context>
     we can provide DATACHAIN_METASTORE_ARG_TEAM_ID=12345 env variable.
     """
     from datachain.catalog import Catalog
+    from datachain.data_storage.sqlite import SQLiteMetastore

+    metastore = get_metastore(in_memory=in_memory)
     return Catalog(
-        metastore=get_metastore(in_memory=in_memory),
+        metastore=metastore,
         warehouse=get_warehouse(in_memory=in_memory),
         client_config=client_config,
         in_memory=in_memory,
+        is_cli=isinstance(metastore, SQLiteMetastore),
     )
</code_context>

<issue_to_address>
CLI mode is inferred from metastore type, which may not be robust.

Relying on isinstance for CLI mode detection may break if new metastore types are added or if SQLiteMetastore is used outside CLI. Please use a more explicit check for CLI mode.

Suggested implementation:

```python
    metastore = get_metastore(in_memory=in_memory)
    # Explicitly determine CLI mode, e.g., via a function argument or environment variable
    is_cli = bool(os.environ.get("DATACHAIN_CLI_MODE", False))
    return Catalog(
        metastore=metastore,
        warehouse=get_warehouse(in_memory=in_memory),
        client_config=client_config,
        in_memory=in_memory,
        is_cli=is_cli,
    )

```

- You will need to import `os` at the top of the file if it is not already imported.
- If CLI mode should be determined differently (e.g., via a function argument), adjust the assignment to `is_cli` accordingly and update the function signature and all call sites.
</issue_to_address>

### Comment 6
<location> `src/datachain/lib/dc/datachain.py:585` </location>
<code_context>
             project = self.session.catalog.metastore.get_project(
                 project_name,
                 namespace_name,
-                create=self.session.catalog.metastore.project_allowed_to_create,
+                create=not self.session.catalog.is_cli,
             )
         except ProjectNotFoundError as e:
</code_context>

<issue_to_address>
Project creation flag is now inverted based on CLI mode.

This change may prevent project creation in CLI mode, which differs from the previous behavior. Please verify if this aligns with the intended permissions.
</issue_to_address>

### Comment 7
<location> `tests/unit/lib/test_namespace.py:29` </location>
<code_context>
-@pytest.mark.parametrize("allow_create_namespace", [False])
+@pytest.mark.parametrize("is_cli", [True])
 @skip_if_not_sqlite
-def test_create_by_user_not_allowed(test_session, allow_create_namespace):
+def test_create_by_user_not_allowed(test_session, is_cli):
     with pytest.raises(NamespaceCreateNotAllowedError) as excinfo:
         create_namespace("dev", session=test_session)
</code_context>

<issue_to_address>
Test for namespace creation denial is preserved and updated.

Consider adding a test for when 'is_cli' is False to verify that namespace creation is permitted in that case.
</issue_to_address>

### Comment 8
<location> `tests/unit/lib/test_project.py:65` </location>
<code_context>
         )


-@pytest.mark.parametrize("allow_create_project", [False])
+@pytest.mark.parametrize("is_cli", [True])
 @skip_if_not_sqlite
-def test_save_create_project_not_allowed(test_session, allow_create_project):
</code_context>

<issue_to_address>
Test for project creation denial updated to use 'is_cli'.

Please also add a test for when 'is_cli' is False to confirm project creation is allowed in that scenario.
</issue_to_address>

### Comment 9
<location> `tests/unit/lib/test_datachain.py:3591` </location>
<code_context>
         )


-@pytest.mark.parametrize("allow_create_project", [False])
+@pytest.mark.parametrize("is_cli", [True])
 @skip_if_not_sqlite
-def test_save_create_project_not_allowed(test_session, allow_create_project):
</code_context>

<issue_to_address>
Test for project creation not allowed updated to use 'is_cli'.

Please add a test for when 'is_cli' is False to ensure both allowed and not allowed cases are covered.

Suggested implementation:

```python
@pytest.mark.parametrize("is_cli", [True, False])
@skip_if_not_sqlite
def test_save_create_project_not_allowed(test_session, is_cli):
    if is_cli:
        with pytest.raises(ProjectCreateNotAllowedError):
            dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
                "dev.numbers.fibonacci"
            )
    else:
        # Should succeed when project creation is allowed
        result = dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
            "dev.numbers.fibonacci"
        )
        assert result is not None

```

- Ensure that the `dc` object and the `save` method are correctly set up to respect the `is_cli` parameter in your actual implementation.
- Adjust the assertion for the allowed case (`is_cli=False`) if there is a more specific expected result than just `result is not None`.
</issue_to_address>

### Comment 10
<location> `tests/unit/lib/test_datachain.py:3226` </location>
<code_context>


 @pytest.mark.parametrize("force", (True, False))
+@pytest.mark.parametrize("is_cli", (True,))
 @skip_if_not_sqlite
 def test_delete_dataset_from_studio(test_session, studio_token, requests_mock, force):
</code_context>

<issue_to_address>
Studio dataset deletion tests parameterized with 'is_cli'.

Please add tests for 'is_cli=False' to cover the non-Studio deletion path as well.

Suggested implementation:

```python
@pytest.mark.parametrize("force", (True, False))
@pytest.mark.parametrize("is_cli", (True, False))
@skip_if_not_sqlite
def test_delete_dataset_from_studio(test_session, studio_token, requests_mock, force):

```

```python
@pytest.mark.parametrize("is_cli", (True, False))
@skip_if_not_sqlite
def test_delete_dataset_from_studio_not_found(
    test_session, studio_token, requests_mock

```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/datachain/lib/namespaces.py

sourcery-ai · 2025-07-08T20:52:11Z

src/datachain/lib/projects.py

@@ -32,7 +32,7 @@ def create(
    """
    session = Session.get(session)

-    if not session.catalog.metastore.project_allowed_to_create:


issue (bug_risk): Project creation restriction logic may be reversed.

The updated condition blocks project creation in CLI mode, which may not be intended. Confirm that this matches the desired permission logic for CLI and Studio environments.

src/datachain/lib/dc/datasets.py

sourcery-ai · 2025-07-08T20:52:12Z

src/datachain/catalog/catalog.py

@@ -527,6 +527,7 @@ def __init__(
            Callable[["AbstractWarehouse"], None]
        ] = None,
        in_memory: bool = False,
+        is_cli: Optional[bool] = True,


suggestion (bug_risk): Defaulting is_cli to True may not always reflect the actual environment.

This could cause incorrect behavior if Catalog is used outside a CLI context. Recommend setting is_cli explicitly where Catalog is instantiated or inferring it from the metastore type.

Suggested implementation:

in_memory: bool = False, is_cli: Optional[bool] = None, ): datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir) datachain_dir.init() } self._warehouse_ready_callback = warehouse_ready_callback self.in_memory = in_memory # Infer is_cli from metastore type if not explicitly provided if is_cli is not None: self._is_cli = is_cli else: # Example inference: if metastore is a CLI metastore, set True, else False # You may need to adjust this logic based on your actual metastore types self._is_cli = hasattr(self.metastore, "is_cli") and getattr(self.metastore, "is_cli", False)

You will need to ensure that all instantiations of this class in your codebase are updated to explicitly pass is_cli if inference is not desired or not possible.

Adjust the inference logic as needed to match your actual metastore implementation and types.

sourcery-ai · 2025-07-08T20:52:12Z

src/datachain/catalog/loader.py

+    from datachain.data_storage.sqlite import SQLiteMetastore

+    metastore = get_metastore(in_memory=in_memory)
    return Catalog(
-        metastore=get_metastore(in_memory=in_memory),
+        metastore=metastore,
        warehouse=get_warehouse(in_memory=in_memory),
        client_config=client_config,
        in_memory=in_memory,
+        is_cli=isinstance(metastore, SQLiteMetastore),


suggestion: CLI mode is inferred from metastore type, which may not be robust.

Relying on isinstance for CLI mode detection may break if new metastore types are added or if SQLiteMetastore is used outside CLI. Please use a more explicit check for CLI mode.

Suggested implementation:

metastore = get_metastore(in_memory=in_memory) # Explicitly determine CLI mode, e.g., via a function argument or environment variable is_cli = bool(os.environ.get("DATACHAIN_CLI_MODE", False)) return Catalog( metastore=metastore, warehouse=get_warehouse(in_memory=in_memory), client_config=client_config, in_memory=in_memory, is_cli=is_cli, )

You will need to import os at the top of the file if it is not already imported.

If CLI mode should be determined differently (e.g., via a function argument), adjust the assignment to is_cli accordingly and update the function signature and all call sites.

sourcery-ai · 2025-07-08T20:52:12Z

tests/unit/lib/test_namespace.py

 @skip_if_not_sqlite
-def test_create_by_user_not_allowed(test_session, allow_create_namespace):


suggestion (testing): Test for namespace creation denial is preserved and updated.

Consider adding a test for when 'is_cli' is False to verify that namespace creation is permitted in that case.

sourcery-ai · 2025-07-08T20:52:12Z

tests/unit/lib/test_project.py

@@ -62,7 +62,7 @@ def test_invalid_name(test_session, name):


 @skip_if_not_sqlite
-@pytest.mark.parametrize("allow_create_project", [False])


suggestion (testing): Test for project creation denial updated to use 'is_cli'.

Please also add a test for when 'is_cli' is False to confirm project creation is allowed in that scenario.

sourcery-ai · 2025-07-08T20:52:12Z

tests/unit/lib/test_datachain.py

@@ -3588,9 +3585,9 @@ def _full_name(namespace, project, name) -> str:
        )


suggestion (testing): Test for project creation not allowed updated to use 'is_cli'.

Please add a test for when 'is_cli' is False to ensure both allowed and not allowed cases are covered.

Suggested implementation:

@pytest.mark.parametrize("is_cli", [True, False]) @skip_if_not_sqlite def test_save_create_project_not_allowed(test_session, is_cli): if is_cli: with pytest.raises(ProjectCreateNotAllowedError): dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save( "dev.numbers.fibonacci" ) else: # Should succeed when project creation is allowed result = dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save( "dev.numbers.fibonacci" ) assert result is not None

Ensure that the dc object and the save method are correctly set up to respect the is_cli parameter in your actual implementation.

Adjust the assertion for the allowed case (is_cli=False) if there is a more specific expected result than just result is not None.

tests/unit/lib/test_datachain.py

sourcery-ai · 2025-07-08T20:52:12Z

src/datachain/cli/commands/datasets.py

+        token = Config().read().get("studio", {}).get("token")
+        if not token:
+            raise DataChainError(
+                "Not logged in to Studio. Log in with 'datachain auth login'."
+            )


issue (code-quality): We've found these issues:

Use named expression to simplify assignment and conditional (use-named-expression)

Lift code into else after jump in control flow (reintroduce-else)

Swap if/else branches (swap-if-else-branches)

cloudflare-workers-and-pages · 2025-07-09T13:35:48Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`dcdf924`
Status:	✅ Deploy successful!
Preview URL:	https://182cfe3d.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-1208-simplify-permis.datachain-documentation.pages.dev

View logs

codecov · 2025-07-09T13:40:12Z

Codecov Report

❌ Patch coverage is 93.10345% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.73%. Comparing base (9e21c31) to head (0b11e63).

Files with missing lines	Patch %	Lines
src/datachain/cli/commands/datasets.py	77.77%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1214      +/-   ##
==========================================
- Coverage   88.73%   88.73%   -0.01%     
==========================================
  Files         155      155              
  Lines       14138    14136       -2     
  Branches     1997     1998       +1     
==========================================
- Hits        12546    12544       -2     
  Misses       1124     1124              
  Partials      468      468

Flag	Coverage Δ
datachain	`88.67% <93.10%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/__init__.py	`100.00% <ø> (ø)`
src/datachain/catalog/catalog.py	`85.99% <100.00%> (+0.03%)`	⬆️
src/datachain/catalog/loader.py	`75.00% <100.00%> (+0.31%)`	⬆️
src/datachain/data_storage/metastore.py	`93.75% <ø> (-0.12%)`	⬇️
src/datachain/data_storage/sqlite.py	`85.64% <ø> (-0.11%)`	⬇️
src/datachain/lib/dc/__init__.py	`100.00% <100.00%> (ø)`
src/datachain/lib/dc/datachain.py	`91.51% <ø> (ø)`
src/datachain/lib/dc/datasets.py	`95.12% <100.00%> (ø)`
src/datachain/lib/dc/utils.py	`83.87% <100.00%> (+0.82%)`	⬆️
src/datachain/lib/namespaces.py	`100.00% <100.00%> (ø)`
... and 3 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…creating-namespace

sourcery-ai

Hey @ilongin - I've reviewed your changes - here's some feedback:

Consider defaulting Catalog.is_cli to False (and overriding it only for CLI contexts in the loader) so that non-SQLite/metastore use-cases aren’t erroneously treated as CLI by default.
The repeated pattern of checking for a studio token and raising a DataChainError in CLI commands could be extracted into a helper to reduce duplication and improve readability.
Add a docstring or brief comment on the new is_cli property in Catalog to clearly document its intended semantics (CLI vs Studio) for future maintainers.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Consider defaulting Catalog.is_cli to False (and overriding it only for CLI contexts in the loader) so that non-SQLite/metastore use-cases aren’t erroneously treated as CLI by default.
- The repeated pattern of checking for a studio token and raising a DataChainError in CLI commands could be extracted into a helper to reduce duplication and improve readability.
- Add a docstring or brief comment on the new is_cli property in Catalog to clearly document its intended semantics (CLI vs Studio) for future maintainers.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/datachain/catalog/catalog.py

shcheklein · 2025-07-14T19:06:00Z

src/datachain/catalog/catalog.py

@@ -1111,7 +1117,12 @@ def get_dataset_with_remote_fallback(
        if version:
            update = False

-        if self.metastore.is_local_dataset(namespace_name) or not update:
+        # local dataset is the one that is in Studio or in CLI but has default namespace
+        is_local = (


I'm not sure I understand this condition - how can we make it simpler, easier to read? (I'm also not sure I understand the comment above)

Before namespaces / projects we didn't have any kind of check here and in Studio we would still try to fetch from remote (Studio) if missing locally, which was wrong (it would fail with strange error like missing token).
Now we can determine if dataset we are fetching is from it's own (local) DB or can be fetched from remote / Studio if missing locally.

We never fallback to Studio if:

Script is already ran in Studio

Dataset starts with local.local.*

I agree this whole function is way to complex and confusing and I will try to refactor it.

I've added better comment for that variable and changed name. I think it should be clear now.

src/datachain/catalog/loader.py

src/datachain/cli/commands/datasets.py

…creating-namespace

src/datachain/cli/commands/datasets.py

shcheklein · 2025-07-17T19:21:04Z

src/datachain/cli/commands/datasets.py

@@ -164,21 +167,22 @@ def edit_dataset(
 ):
    namespace_name, project_name, name = catalog.get_full_dataset_name(name)

-    if catalog.metastore.is_local_dataset(namespace_name):
+    if namespace_name != catalog.metastore.default_namespace_name:


let's introduce a descriptive var - studio_dataset = ...

to make condisiton descriptive:

if studio_dataset:
....

(tbh still don't a lot this very non obvious way to detect it by analyzing namespaces, comparing it with default - one needs to know a lot about namespaces to understand this code and why it is correct - it is not obvious)

Added is_studio_dataset var as you suggested.

Regarding second comment, the only way to fix this and make explicitly is to re-introduce studio flag but that's exactly what we tried to avoid in the first place with namespaces / projects - having those flags everywhere ... we wanted to be able to decide implicitly if some dataset is in Studio or local and that's what this code does .. trade-of is that code is less clean.

I just wonder if we can use another approach for this, like set env variable or use one of env variables we already have (like DATACHAIN_COMPUTE_CLUSTER_ID).

This is because it is a general task: to know if query was running from within Studio or CLI, @dmpetrov was asking about this specifically, and it would be great to have something like dc.is_studio flag/variable without instantiating whole Catalog, and Catalog itself can get this knowledge from the save variable seamless, without additional param in constructor.

Yes, I was thinking about this as well. Didn't want to introduce new env variable for now as I didn't know we need this in general as you mentioned but if we do need I can add as well.

Added env variable called DATACHAIN_IS_STUDIO and function to check it, e.g

import datachain as dc dc.is_studio() # True or False

It's function as it's easier to handle and write tests (mock)

…creating-namespace

shcheklein · 2025-08-21T01:50:35Z

src/datachain/cli/commands/datasets.py

@@ -164,21 +167,24 @@ def edit_dataset(
 ):
    namespace_name, project_name, name = catalog.get_full_dataset_name(name)

-    if catalog.metastore.is_local_dataset(namespace_name):
+    # studio dataset is if it has non "local" namespace
+    is_studio_dataset = namespace_name != catalog.metastore.default_namespace_name


how about Studio dataset with a default global namespace?

You mean if this will still work correctly? It should as it's different from local which is in catalog.metastore.default_namespace_name

catalog.metastore.default_namespace_name is it local even for Studio? how do we redefine it?

Ok, now I see this better. So this is part of the code that will only run in local (CLI). In Studio we don't need CLI, right.
Here we need to check if the dataset user want's to edit is his local dataset from .datachain folder or it's a remote Studio dataset which happens only if dataset's namespace is different from default one which is always local in local / CLI run. I could have hardcoded local here instead of catalog.metastore.default_namespace_name as well.

Note: this code would break if someone runs CLI in Studio - it would try to go to another Studio if namespace is different from global (default in Studio) ... but this code is never going to run from Studio in the first place - at least that wasn't ever the plan.

can we put some assert that it is not Studio? (to document and prevent it from running)?

Added check to not allow CLI from Studio environment.

Let’s use the helper I mentioned above

tests/unit/lib/test_datachain.py

shcheklein · 2025-08-21T01:53:31Z

tests/func/test_read_dataset_remote.py

@@ -445,6 +449,7 @@ def test_read_dataset_remote_update_flag_no_version(


 @skip_if_not_sqlite
+@pytest.mark.parametrize("is_studio", (False,))


how does it exactly work?

pytest.mark.parametrize("is_studio", (False,))

what do we need this for?

By default, all tests run with Catalog.is_studio flag set to True. This is to allow creating namespaces / projects etc .. i.e. "studio" datachain allows more things, so to avoid the need to mock it explicitly to True all the time it's set to True by default in tests.
We can also change that with this fixture - set it to False, or both True and False, depending on a test need. This particular test is testing CLI so we need to set flag to False.
Logic around that fixture is in conftest.py: is_studio and mock_is_studio fixtures.

Initial reaction - I'm lost :)

Second thought - two fixtures with negative condition, there gotta be a better / simpler way to do this that is clear when you read the code ...

Third thought (after reading conftest a bit):

why autouse for this fixture? can we just pass mock_is_studio where it is appropriate? That would a straightforward way to use mocks ... am I missing something?

@skip_if_not_sqlite can this one also define this ... or can it be replace with a single decorator?

why do we need to set False (studio) explicitly - shouldn't be it default if run it on sqlite only already?

Ok, so this is the problem:

Default value for Catalog.is_studio is False

There are a lot of scenarios, mostly related to having custom namespace / project in tests, for which we need Catalog.is_studio to be True

There are rare scenarios / tests where we need Catalog.is_studio to be explicitly False (CLI methods testing)

With 2) and 3) in mind it's better to have that set to True by default and have special fixture to set it to False on those rare tests when needed then other way around (we would then need to repeat fixture in a lot of tests to set that flag to True)

We don't want to go into direction of splitting tests for CLI and Studio like we did with @skip_if_not_sqlite fixture.

Solution was to by default set Catalog.is_studio to be True so that we are not limited in terms of creating custom namespaces / projects in tests and this is achieved with:
mock_is_studio is autoused and it sets Catalog.is_studio to True unless value of other fixture is_studio is set to False - then we just yield which results in not setting flag to True which then stays False (default). This way we can set Catalog.is_studio to False just by adding is_studio fixture to be False on those rare tests where we need this scenario (CLI methods testing as mentioned above).

I know this is little bit complex but at the time I didn't have better idea. Maybe this will be simplified when I introduce ENV variable to control this is_studio thing, as Vlad suggested .. but on first thought I don't think it will change much.
If someone has better idea how to handle this I'm open to suggestion ... this is actually a general problem of having slightly different logic on Studio vs local and wanting to have non separated tests for them both.

thanks @ilongin for a good explanation!

There are a lot of scenarios, mostly related to having custom namespace / project in tests, for which we need Catalog.is_studio to be True

do we need primarily for tests that pull datasets from Studio one way or another? or what are other major examples

shcheklein · 2025-08-21T01:54:49Z

src/datachain/cli/commands/datasets.py

+    if is_studio_dataset:
+        from datachain.studio import edit_studio_dataset
+
+        if Config().read().get("studio", {}).get("token"):


unrelated: don't we have some util in StudioClient to do this? to get the token?

Not sure but we def need to refactor this file as it duplicates these lines multiple times.

shcheklein

Looks good overall .. some minor questions

…creating-namespace

shcheklein · 2025-08-23T16:47:16Z

src/datachain/catalog/catalog.py

+        # to fetch dataset with local.local namespace & project as that one cannot
+        # exist in Studio in the first place
+        no_fallback = (
+            is_studio() or namespace_name == self.metastore.default_namespace_name


Let,s please add a helper is-namespae-local with assert that it is “local” and with an explanation and with this namespace check

I mean instead of the second part only … to make more explanatory

fixing tests and logic

f22f6fb

ilongin marked this pull request as draft July 8, 2025 20:51

sourcery-ai bot reviewed Jul 8, 2025

View reviewed changes

fixing tests

7f8a2a9

ilongin added 4 commits July 9, 2025 16:17

Merge branch 'main' into ilongin/1208-simplify-permission-checks-for-…

b66380c

…creating-namespace

fixing move dataset

8fe0f6e

Merge branch 'main' into ilongin/1208-simplify-permission-checks-for-…

dd75c29

…creating-namespace

refactoring

312c677

ilongin marked this pull request as ready for review July 12, 2025 22:58

ilongin requested review from dreadatour, shcheklein and amritghimire July 12, 2025 22:58

sourcery-ai bot reviewed Jul 12, 2025

View reviewed changes

ilongin mentioned this pull request Jul 14, 2025

Namespaces followup #1236

Closed

2 tasks

shcheklein reviewed Jul 14, 2025

View reviewed changes

src/datachain/catalog/catalog.py Outdated Show resolved Hide resolved

shcheklein reviewed Jul 14, 2025

View reviewed changes

src/datachain/catalog/loader.py Outdated Show resolved Hide resolved

shcheklein reviewed Jul 14, 2025

View reviewed changes

src/datachain/cli/commands/datasets.py Outdated Show resolved Hide resolved

ilongin added 3 commits July 16, 2025 17:26

merging with main

d812b59

refactoring

3cd905f

Merge branch 'main' into ilongin/1208-simplify-permission-checks-for-…

87715c4

…creating-namespace

ilongin requested a review from shcheklein July 17, 2025 13:11

shcheklein reviewed Jul 17, 2025

View reviewed changes

src/datachain/cli/commands/datasets.py Outdated Show resolved Hide resolved

shcheklein reviewed Jul 17, 2025

View reviewed changes

ivan and others added 3 commits August 20, 2025 06:44

Merge branch 'main' into ilongin/1208-simplify-permission-checks-for-…

e812989

…creating-namespace

changed is_cli to is_studio

9f1111a

removed dead code

357756d

making code more explicit

cd9849b

ilongin requested a review from shcheklein August 20, 2025 11:29

shcheklein reviewed Aug 21, 2025

View reviewed changes

tests/unit/lib/test_datachain.py Outdated Show resolved Hide resolved

shcheklein reviewed Aug 21, 2025

View reviewed changes

ilongin and others added 4 commits August 21, 2025 16:45

Merge branch 'main' into ilongin/1208-simplify-permission-checks-for-…

37f1f96

…creating-namespace

using env variable to decide if chain is running in studio or local

7951c0a

Merge branch 'main' into ilongin/1208-simplify-permission-checks-for-…

0b11e63

…creating-namespace

not allow CLI commands in Studio environment

dcdf924

shcheklein reviewed Aug 23, 2025

View reviewed changes

		@skip_if_not_sqlite
		def test_create_by_user_not_allowed(test_session, allow_create_namespace):

		@@ -62,7 +62,7 @@ def test_invalid_name(test_session, name):


		@skip_if_not_sqlite
		@pytest.mark.parametrize("allow_create_project", [False])

		@@ -3588,9 +3585,9 @@ def _full_name(namespace, project, name) -> str:
		)

		@@ -445,6 +449,7 @@ def test_read_dataset_remote_update_flag_no_version(


		@skip_if_not_sqlite
		@pytest.mark.parametrize("is_studio", (False,))

Simplify permission checks for creating namespaces #1214

Are you sure you want to change the base?

Simplify permission checks for creating namespaces #1214

Uh oh!

Conversation

ilongin commented Jul 8, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for dataset removal with new is_cli logic

Class diagram for Catalog and Metastore permission refactor

Class diagram for loader and Catalog instantiation changes

Class diagram for namespace and project creation permission checks

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sourcery-ai bot Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

codecov bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilongin Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ilongin commented Jul 8, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jul 8, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Jul 9, 2025 •

edited

Loading

codecov bot commented Jul 9, 2025 •

edited

Loading

ilongin Jul 16, 2025 •

edited

Loading

ilongin Aug 21, 2025 •

edited

Loading

ilongin Aug 21, 2025 •

edited

Loading