-
Notifications
You must be signed in to change notification settings - Fork 118
Simplify permission checks for creating namespaces #1214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Simplify permission checks for creating namespaces #1214
Conversation
Reviewer's GuideIntroduces a centralized Catalog.is_cli flag to consolidate environment context, removes deprecated metastore permission properties, refactors CLI commands and loader to route between local and Studio operations based on is_cli, updates save and creation workflows to enforce CLI restrictions, and aligns tests with the new mechanism. Sequence diagram for dataset removal with new is_cli logicsequenceDiagram
actor User
participant CLI
participant Catalog
participant Config
participant Studio
participant Metastore
User->>CLI: rm_dataset(...)
CLI->>Catalog: get_full_dataset_name(name)
CLI->>Catalog: is_cli
alt is_cli and studio
CLI->>Config: read studio token
alt token exists
CLI->>Studio: remove_studio_dataset(...)
else token missing
CLI->>CLI: raise DataChainError
end
else
CLI->>Metastore: get_project(...)
CLI->>Catalog: edit local dataset
end
Class diagram for Catalog and Metastore permission refactorclassDiagram
class Catalog {
- _is_cli: bool
+ is_cli: bool
}
class AbstractMetastore {
<<abstract>>
- Removed: is_studio: bool
- Removed: is_local_dataset(dataset_namespace: str): bool
- Removed: namespace_allowed_to_create: bool
- Removed: project_allowed_to_create: bool
}
Catalog --> AbstractMetastore : metastore
class SQLiteMetastore {
// No longer implements is_studio
}
AbstractMetastore <|-- SQLiteMetastore
Class diagram for loader and Catalog instantiation changesclassDiagram
class Loader {
+ get_catalog(...): Catalog
}
class Catalog {
+ is_cli: bool
}
Loader --> Catalog : returns
class SQLiteMetastore
Loader ..> SQLiteMetastore : uses for is_cli detection
Class diagram for namespace and project creation permission checksclassDiagram
class Session {
+ catalog: Catalog
}
class Catalog {
+ is_cli: bool
}
class Namespace {
+ validate_name(name)
}
class Project {
+ validate_name(name)
}
Session --> Catalog
Namespace ..> Session : uses
Project ..> Session : uses
Namespace ..> Catalog : checks is_cli for permission
Project ..> Catalog : checks is_cli for permission
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ilongin - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/datachain/lib/namespaces.py:31` </location>
<code_context>
"""
session = Session.get(session)
- if not session.catalog.metastore.namespace_allowed_to_create:
+ if session.catalog.is_cli:
raise NamespaceCreateNotAllowedError("Creating namespace is not allowed")
</code_context>
<issue_to_address>
Logic for namespace creation restriction appears inverted.
This change may block namespace creation in CLI mode even when permitted. Please verify that this aligns with the intended behavior for both CLI and Studio environments.
</issue_to_address>
### Comment 2
<location> `src/datachain/lib/projects.py:35` </location>
<code_context>
"""
session = Session.get(session)
- if not session.catalog.metastore.project_allowed_to_create:
+ if session.catalog.is_cli:
raise ProjectCreateNotAllowedError("Creating project is not allowed")
</code_context>
<issue_to_address>
Project creation restriction logic may be reversed.
The updated condition blocks project creation in CLI mode, which may not be intended. Confirm that this matches the desired permission logic for CLI and Studio environments.
</issue_to_address>
### Comment 3
<location> `src/datachain/lib/dc/datasets.py:346` </location>
<code_context>
):
namespace_name, project_name, name = catalog.get_full_dataset_name(name)
- if not catalog.metastore.is_local_dataset(namespace_name) and studio:
+ if catalog.is_cli and studio:
+ # removing Studio dataset from CLI
from datachain.studio import remove_studio_dataset
</code_context>
<issue_to_address>
Dataset deletion logic now depends on CLI mode rather than dataset locality.
Please verify that this change aligns with the intended behavior and that all relevant scenarios are covered.
</issue_to_address>
### Comment 4
<location> `src/datachain/catalog/catalog.py:530` </location>
<code_context>
Callable[["AbstractWarehouse"], None]
] = None,
in_memory: bool = False,
+ is_cli: Optional[bool] = True,
):
datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
</code_context>
<issue_to_address>
Defaulting is_cli to True may not always reflect the actual environment.
This could cause incorrect behavior if Catalog is used outside a CLI context. Recommend setting is_cli explicitly where Catalog is instantiated or inferring it from the metastore type.
Suggested implementation:
```python
in_memory: bool = False,
is_cli: Optional[bool] = None,
):
datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
datachain_dir.init()
}
self._warehouse_ready_callback = warehouse_ready_callback
self.in_memory = in_memory
# Infer is_cli from metastore type if not explicitly provided
if is_cli is not None:
self._is_cli = is_cli
else:
# Example inference: if metastore is a CLI metastore, set True, else False
# You may need to adjust this logic based on your actual metastore types
self._is_cli = hasattr(self.metastore, "is_cli") and getattr(self.metastore, "is_cli", False)
```
- You will need to ensure that all instantiations of this class in your codebase are updated to explicitly pass `is_cli` if inference is not desired or not possible.
- Adjust the inference logic as needed to match your actual metastore implementation and types.
</issue_to_address>
### Comment 5
<location> `src/datachain/catalog/loader.py:140` </location>
<code_context>
we can provide DATACHAIN_METASTORE_ARG_TEAM_ID=12345 env variable.
"""
from datachain.catalog import Catalog
+ from datachain.data_storage.sqlite import SQLiteMetastore
+ metastore = get_metastore(in_memory=in_memory)
return Catalog(
- metastore=get_metastore(in_memory=in_memory),
+ metastore=metastore,
warehouse=get_warehouse(in_memory=in_memory),
client_config=client_config,
in_memory=in_memory,
+ is_cli=isinstance(metastore, SQLiteMetastore),
)
</code_context>
<issue_to_address>
CLI mode is inferred from metastore type, which may not be robust.
Relying on isinstance for CLI mode detection may break if new metastore types are added or if SQLiteMetastore is used outside CLI. Please use a more explicit check for CLI mode.
Suggested implementation:
```python
metastore = get_metastore(in_memory=in_memory)
# Explicitly determine CLI mode, e.g., via a function argument or environment variable
is_cli = bool(os.environ.get("DATACHAIN_CLI_MODE", False))
return Catalog(
metastore=metastore,
warehouse=get_warehouse(in_memory=in_memory),
client_config=client_config,
in_memory=in_memory,
is_cli=is_cli,
)
```
- You will need to import `os` at the top of the file if it is not already imported.
- If CLI mode should be determined differently (e.g., via a function argument), adjust the assignment to `is_cli` accordingly and update the function signature and all call sites.
</issue_to_address>
### Comment 6
<location> `src/datachain/lib/dc/datachain.py:585` </location>
<code_context>
project = self.session.catalog.metastore.get_project(
project_name,
namespace_name,
- create=self.session.catalog.metastore.project_allowed_to_create,
+ create=not self.session.catalog.is_cli,
)
except ProjectNotFoundError as e:
</code_context>
<issue_to_address>
Project creation flag is now inverted based on CLI mode.
This change may prevent project creation in CLI mode, which differs from the previous behavior. Please verify if this aligns with the intended permissions.
</issue_to_address>
### Comment 7
<location> `tests/unit/lib/test_namespace.py:29` </location>
<code_context>
-@pytest.mark.parametrize("allow_create_namespace", [False])
+@pytest.mark.parametrize("is_cli", [True])
@skip_if_not_sqlite
-def test_create_by_user_not_allowed(test_session, allow_create_namespace):
+def test_create_by_user_not_allowed(test_session, is_cli):
with pytest.raises(NamespaceCreateNotAllowedError) as excinfo:
create_namespace("dev", session=test_session)
</code_context>
<issue_to_address>
Test for namespace creation denial is preserved and updated.
Consider adding a test for when 'is_cli' is False to verify that namespace creation is permitted in that case.
</issue_to_address>
### Comment 8
<location> `tests/unit/lib/test_project.py:65` </location>
<code_context>
)
-@pytest.mark.parametrize("allow_create_project", [False])
+@pytest.mark.parametrize("is_cli", [True])
@skip_if_not_sqlite
-def test_save_create_project_not_allowed(test_session, allow_create_project):
</code_context>
<issue_to_address>
Test for project creation denial updated to use 'is_cli'.
Please also add a test for when 'is_cli' is False to confirm project creation is allowed in that scenario.
</issue_to_address>
### Comment 9
<location> `tests/unit/lib/test_datachain.py:3591` </location>
<code_context>
)
-@pytest.mark.parametrize("allow_create_project", [False])
+@pytest.mark.parametrize("is_cli", [True])
@skip_if_not_sqlite
-def test_save_create_project_not_allowed(test_session, allow_create_project):
</code_context>
<issue_to_address>
Test for project creation not allowed updated to use 'is_cli'.
Please add a test for when 'is_cli' is False to ensure both allowed and not allowed cases are covered.
Suggested implementation:
```python
@pytest.mark.parametrize("is_cli", [True, False])
@skip_if_not_sqlite
def test_save_create_project_not_allowed(test_session, is_cli):
if is_cli:
with pytest.raises(ProjectCreateNotAllowedError):
dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
"dev.numbers.fibonacci"
)
else:
# Should succeed when project creation is allowed
result = dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
"dev.numbers.fibonacci"
)
assert result is not None
```
- Ensure that the `dc` object and the `save` method are correctly set up to respect the `is_cli` parameter in your actual implementation.
- Adjust the assertion for the allowed case (`is_cli=False`) if there is a more specific expected result than just `result is not None`.
</issue_to_address>
### Comment 10
<location> `tests/unit/lib/test_datachain.py:3226` </location>
<code_context>
@pytest.mark.parametrize("force", (True, False))
+@pytest.mark.parametrize("is_cli", (True,))
@skip_if_not_sqlite
def test_delete_dataset_from_studio(test_session, studio_token, requests_mock, force):
</code_context>
<issue_to_address>
Studio dataset deletion tests parameterized with 'is_cli'.
Please add tests for 'is_cli=False' to cover the non-Studio deletion path as well.
Suggested implementation:
```python
@pytest.mark.parametrize("force", (True, False))
@pytest.mark.parametrize("is_cli", (True, False))
@skip_if_not_sqlite
def test_delete_dataset_from_studio(test_session, studio_token, requests_mock, force):
```
```python
@pytest.mark.parametrize("is_cli", (True, False))
@skip_if_not_sqlite
def test_delete_dataset_from_studio_not_found(
test_session, studio_token, requests_mock
```
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
@@ -32,7 +32,7 @@ def create( | |||
""" | |||
session = Session.get(session) | |||
|
|||
if not session.catalog.metastore.project_allowed_to_create: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Project creation restriction logic may be reversed.
The updated condition blocks project creation in CLI mode, which may not be intended. Confirm that this matches the desired permission logic for CLI and Studio environments.
src/datachain/catalog/catalog.py
Outdated
@@ -527,6 +527,7 @@ def __init__( | |||
Callable[["AbstractWarehouse"], None] | |||
] = None, | |||
in_memory: bool = False, | |||
is_cli: Optional[bool] = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Defaulting is_cli to True may not always reflect the actual environment.
This could cause incorrect behavior if Catalog is used outside a CLI context. Recommend setting is_cli explicitly where Catalog is instantiated or inferring it from the metastore type.
Suggested implementation:
in_memory: bool = False,
is_cli: Optional[bool] = None,
):
datachain_dir = DataChainDir(cache=cache_dir, tmp=tmp_dir)
datachain_dir.init()
}
self._warehouse_ready_callback = warehouse_ready_callback
self.in_memory = in_memory
# Infer is_cli from metastore type if not explicitly provided
if is_cli is not None:
self._is_cli = is_cli
else:
# Example inference: if metastore is a CLI metastore, set True, else False
# You may need to adjust this logic based on your actual metastore types
self._is_cli = hasattr(self.metastore, "is_cli") and getattr(self.metastore, "is_cli", False)
- You will need to ensure that all instantiations of this class in your codebase are updated to explicitly pass
is_cli
if inference is not desired or not possible. - Adjust the inference logic as needed to match your actual metastore implementation and types.
src/datachain/catalog/loader.py
Outdated
from datachain.data_storage.sqlite import SQLiteMetastore | ||
|
||
metastore = get_metastore(in_memory=in_memory) | ||
return Catalog( | ||
metastore=get_metastore(in_memory=in_memory), | ||
metastore=metastore, | ||
warehouse=get_warehouse(in_memory=in_memory), | ||
client_config=client_config, | ||
in_memory=in_memory, | ||
is_cli=isinstance(metastore, SQLiteMetastore), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: CLI mode is inferred from metastore type, which may not be robust.
Relying on isinstance for CLI mode detection may break if new metastore types are added or if SQLiteMetastore is used outside CLI. Please use a more explicit check for CLI mode.
Suggested implementation:
metastore = get_metastore(in_memory=in_memory)
# Explicitly determine CLI mode, e.g., via a function argument or environment variable
is_cli = bool(os.environ.get("DATACHAIN_CLI_MODE", False))
return Catalog(
metastore=metastore,
warehouse=get_warehouse(in_memory=in_memory),
client_config=client_config,
in_memory=in_memory,
is_cli=is_cli,
)
- You will need to import
os
at the top of the file if it is not already imported. - If CLI mode should be determined differently (e.g., via a function argument), adjust the assignment to
is_cli
accordingly and update the function signature and all call sites.
@skip_if_not_sqlite | ||
def test_create_by_user_not_allowed(test_session, allow_create_namespace): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Test for namespace creation denial is preserved and updated.
Consider adding a test for when 'is_cli' is False to verify that namespace creation is permitted in that case.
@@ -62,7 +62,7 @@ def test_invalid_name(test_session, name): | |||
|
|||
|
|||
@skip_if_not_sqlite | |||
@pytest.mark.parametrize("allow_create_project", [False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Test for project creation denial updated to use 'is_cli'.
Please also add a test for when 'is_cli' is False to confirm project creation is allowed in that scenario.
@@ -3588,9 +3585,9 @@ def _full_name(namespace, project, name) -> str: | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (testing): Test for project creation not allowed updated to use 'is_cli'.
Please add a test for when 'is_cli' is False to ensure both allowed and not allowed cases are covered.
Suggested implementation:
@pytest.mark.parametrize("is_cli", [True, False])
@skip_if_not_sqlite
def test_save_create_project_not_allowed(test_session, is_cli):
if is_cli:
with pytest.raises(ProjectCreateNotAllowedError):
dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
"dev.numbers.fibonacci"
)
else:
# Should succeed when project creation is allowed
result = dc.read_values(fib=[1, 1, 2, 3, 5, 8], session=test_session).save(
"dev.numbers.fibonacci"
)
assert result is not None
- Ensure that the
dc
object and thesave
method are correctly set up to respect theis_cli
parameter in your actual implementation. - Adjust the assertion for the allowed case (
is_cli=False
) if there is a more specific expected result than justresult is not None
.
token = Config().read().get("studio", {}).get("token") | ||
if not token: | ||
raise DataChainError( | ||
"Not logged in to Studio. Log in with 'datachain auth login'." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Use named expression to simplify assignment and conditional (
use-named-expression
) - Lift code into else after jump in control flow (
reintroduce-else
) - Swap if/else branches (
swap-if-else-branches
)
Deploying datachain-documentation with
|
Latest commit: |
dcdf924
|
Status: | ✅ Deploy successful! |
Preview URL: | https://182cfe3d.datachain-documentation.pages.dev |
Branch Preview URL: | https://ilongin-1208-simplify-permis.datachain-documentation.pages.dev |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1214 +/- ##
==========================================
- Coverage 88.73% 88.73% -0.01%
==========================================
Files 155 155
Lines 14138 14136 -2
Branches 1997 1998 +1
==========================================
- Hits 12546 12544 -2
Misses 1124 1124
Partials 468 468
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ilongin - I've reviewed your changes - here's some feedback:
- Consider defaulting Catalog.is_cli to False (and overriding it only for CLI contexts in the loader) so that non-SQLite/metastore use-cases aren’t erroneously treated as CLI by default.
- The repeated pattern of checking for a studio token and raising a DataChainError in CLI commands could be extracted into a helper to reduce duplication and improve readability.
- Add a docstring or brief comment on the new is_cli property in Catalog to clearly document its intended semantics (CLI vs Studio) for future maintainers.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider defaulting Catalog.is_cli to False (and overriding it only for CLI contexts in the loader) so that non-SQLite/metastore use-cases aren’t erroneously treated as CLI by default.
- The repeated pattern of checking for a studio token and raising a DataChainError in CLI commands could be extracted into a helper to reduce duplication and improve readability.
- Add a docstring or brief comment on the new is_cli property in Catalog to clearly document its intended semantics (CLI vs Studio) for future maintainers.
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
src/datachain/catalog/catalog.py
Outdated
@@ -1111,7 +1117,12 @@ def get_dataset_with_remote_fallback( | |||
if version: | |||
update = False | |||
|
|||
if self.metastore.is_local_dataset(namespace_name) or not update: | |||
# local dataset is the one that is in Studio or in CLI but has default namespace | |||
is_local = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this condition - how can we make it simpler, easier to read? (I'm also not sure I understand the comment above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before namespaces / projects we didn't have any kind of check here and in Studio we would still try to fetch from remote (Studio) if missing locally, which was wrong (it would fail with strange error like missing token).
Now we can determine if dataset we are fetching is from it's own (local) DB or can be fetched from remote / Studio if missing locally.
We never fallback to Studio if:
- Script is already ran in Studio
- Dataset starts with
local.local.*
I agree this whole function is way to complex and confusing and I will try to refactor it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added better comment for that variable and changed name. I think it should be clear now.
@@ -164,21 +167,22 @@ def edit_dataset( | |||
): | |||
namespace_name, project_name, name = catalog.get_full_dataset_name(name) | |||
|
|||
if catalog.metastore.is_local_dataset(namespace_name): | |||
if namespace_name != catalog.metastore.default_namespace_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's introduce a descriptive var - studio_dataset = ...
to make condisiton descriptive:
if studio_dataset:
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(tbh still don't a lot this very non obvious way to detect it by analyzing namespaces, comparing it with default - one needs to know a lot about namespaces to understand this code and why it is correct - it is not obvious)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added is_studio_dataset
var as you suggested.
Regarding second comment, the only way to fix this and make explicitly is to re-introduce studio
flag but that's exactly what we tried to avoid in the first place with namespaces / projects - having those flags everywhere ... we wanted to be able to decide implicitly if some dataset is in Studio or local and that's what this code does .. trade-of is that code is less clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wonder if we can use another approach for this, like set env variable or use one of env variables we already have (like DATACHAIN_COMPUTE_CLUSTER_ID
).
This is because it is a general task: to know if query was running from within Studio or CLI, @dmpetrov was asking about this specifically, and it would be great to have something like dc.is_studio
flag/variable without instantiating whole Catalog
, and Catalog
itself can get this knowledge from the save variable seamless, without additional param in constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking about this as well. Didn't want to introduce new env variable for now as I didn't know we need this in general as you mentioned but if we do need I can add as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added env variable called DATACHAIN_IS_STUDIO
and function to check it, e.g
import datachain as dc
dc.is_studio() # True or False
It's function as it's easier to handle and write tests (mock)
@@ -164,21 +167,24 @@ def edit_dataset( | |||
): | |||
namespace_name, project_name, name = catalog.get_full_dataset_name(name) | |||
|
|||
if catalog.metastore.is_local_dataset(namespace_name): | |||
# studio dataset is if it has non "local" namespace | |||
is_studio_dataset = namespace_name != catalog.metastore.default_namespace_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about Studio dataset with a default global
namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean if this will still work correctly? It should as it's different from local
which is in catalog.metastore.default_namespace_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catalog.metastore.default_namespace_name
is it local even for Studio? how do we redefine it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I see this better. So this is part of the code that will only run in local (CLI). In Studio we don't need CLI, right.
Here we need to check if the dataset user want's to edit is his local dataset from .datachain
folder or it's a remote Studio dataset which happens only if dataset's namespace is different from default one which is always local
in local / CLI run. I could have hardcoded local
here instead of catalog.metastore.default_namespace_name
as well.
Note: this code would break if someone runs CLI in Studio - it would try to go to another Studio if namespace is different from global
(default in Studio) ... but this code is never going to run from Studio in the first place - at least that wasn't ever the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put some assert that it is not Studio? (to document and prevent it from running)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added check to not allow CLI from Studio environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s use the helper I mentioned above
@@ -445,6 +449,7 @@ def test_read_dataset_remote_update_flag_no_version( | |||
|
|||
|
|||
@skip_if_not_sqlite | |||
@pytest.mark.parametrize("is_studio", (False,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does it exactly work?
pytest.mark.parametrize("is_studio", (False,))
what do we need this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, all tests run with Catalog.is_studio
flag set to True
. This is to allow creating namespaces / projects etc .. i.e. "studio" datachain allows more things, so to avoid the need to mock it explicitly to True
all the time it's set to True
by default in tests.
We can also change that with this fixture - set it to False
, or both True
and False
, depending on a test need. This particular test is testing CLI so we need to set flag to False
.
Logic around that fixture is in conftest.py: is_studio
and mock_is_studio
fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial reaction - I'm lost :)
Second thought - two fixtures with negative condition, there gotta be a better / simpler way to do this that is clear when you read the code ...
Third thought (after reading conftest a bit):
- why autouse for this fixture? can we just pass
mock_is_studio
where it is appropriate? That would a straightforward way to use mocks ... am I missing something? @skip_if_not_sqlite
can this one also define this ... or can it be replace with a single decorator?- why do we need to set False (studio) explicitly - shouldn't be it default if run it on sqlite only already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so this is the problem:
- Default value for
Catalog.is_studio
isFalse
- There are a lot of scenarios, mostly related to having custom namespace / project in tests, for which we need
Catalog.is_studio
to beTrue
- There are rare scenarios / tests where we need
Catalog.is_studio
to be explicitlyFalse
(CLI methods testing) - With 2) and 3) in mind it's better to have that set to
True
by default and have special fixture to set it toFalse
on those rare tests when needed then other way around (we would then need to repeat fixture in a lot of tests to set that flag toTrue
) - We don't want to go into direction of splitting tests for CLI and Studio like we did with
@skip_if_not_sqlite
fixture.
Solution was to by default set Catalog.is_studio
to be True
so that we are not limited in terms of creating custom namespaces / projects in tests and this is achieved with:
mock_is_studio
is autoused and it sets Catalog.is_studio
to True
unless value of other fixture is_studio
is set to False
- then we just yield which results in not setting flag to True
which then stays False
(default). This way we can set Catalog.is_studio
to False
just by adding is_studio
fixture to be False
on those rare tests where we need this scenario (CLI methods testing as mentioned above).
I know this is little bit complex but at the time I didn't have better idea. Maybe this will be simplified when I introduce ENV variable to control this is_studio
thing, as Vlad suggested .. but on first thought I don't think it will change much.
If someone has better idea how to handle this I'm open to suggestion ... this is actually a general problem of having slightly different logic on Studio vs local and wanting to have non separated tests for them both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @ilongin for a good explanation!
There are a lot of scenarios, mostly related to having custom namespace / project in tests, for which we need Catalog.is_studio to be True
do we need primarily for tests that pull datasets from Studio one way or another? or what are other major examples
if is_studio_dataset: | ||
from datachain.studio import edit_studio_dataset | ||
|
||
if Config().read().get("studio", {}).get("token"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated: don't we have some util in StudioClient to do this? to get the token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure but we def need to refactor this file as it duplicates these lines multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall .. some minor questions
…creating-namespace
…creating-namespace
# to fetch dataset with local.local namespace & project as that one cannot | ||
# exist in Studio in the first place | ||
no_fallback = ( | ||
is_studio() or namespace_name == self.metastore.default_namespace_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let,s please add a helper is-namespae-local with assert that it is “local” and with an explanation and with this namespace check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean instead of the second part only … to make more explanatory
Trying to simplify logic around permissions like "is creation of namespace / project allowed or not" by lifting awareness of "is the process run in CLI or Studio" to
Catalog
class.This allowed to remove couple of methods from metastore:
is_studio
is_local_dataset
namespace_allowed_to_create
project_allowed_to_create
Summary by Sourcery
Simplify permission checks by replacing metastore-based flags with a centralized
Catalog.is_cli
indicator, update CLI commands to use it for routing between local and Studio operations, refactor loader and save logic accordingly, and align tests with the new mechanism.Enhancements:
Catalog.is_cli
flag and remove deprecated metastore properties (is_studio
,is_local_dataset
,namespace_allowed_to_create
,project_allowed_to_create
).rm_dataset
,edit_dataset
,delete_dataset
) to usecatalog.is_cli
for permission and routing logic.is_cli
based on metastore type and propagate it through theCatalog
constructor.Tests:
allow_create_project
andallow_create_namespace
fixtures with a singleis_cli
fixture and mockCatalog.is_cli
.is_cli
instead of legacy allow-create flags.Summary by Sourcery
Centralize environment awareness by adding Catalog.is_cli and removing legacy metastore flags, refactor related dataset, namespace, and project commands to use the new flag, and align tests with the updated permission mechanism
Enhancements:
Tests: