Skip to content

Conversation

amritghimire
Copy link
Contributor

@amritghimire amritghimire commented Jul 9, 2025

This adds the support for following command

cp

Local to local

datachain cp /tmp/test-2.py /tmp/test_folder/test-1.py
datachain cp -r /tmp/test_folder /tmp/testing_folder

local to remote

datachain cp /tmp/test-2.py gs://amrit-datachain-test/playground/test-5.py
datachain cp /tmp/test-2.py gs://amrit-datachain-test/playground/test-6.py --studio-cloud-auth
datachain cp --recursive /tmp/test_folder gs://amrit-datachain-test/playground
datachain cp --recursive /tmp/test_folder gs://amrit-datachain-test/playground --studio-cloud-auth

Remote to local

datachain cp gs://amrit-datachain-test/playground/test-5.py /tmp/output_folder/
datachain cp gs://amrit-datachain-test/playground/test-5.py /tmp/output_folder/ —studio-cloud-auth
datachain cp -r gs://amrit-datachain-test/playground/test_folder /tmp/output_folder/ 

Remote to remote

datachain cp gs://amrit-datachain-test/playground/test_folder/test-3.py gs://amrit-datachain-test/playground/test_folder/test-4.py
datachain cp gs://amrit-datachain-test/playground/test_folder/test-3.py gs://amrit-datachain-test/playground/test_folder/test-4.py —studio-cloud-auth

MV

datachain mv gs://amrit-datachain-test/playground/test.py gs://amrit-datachain-test/playground-2/test.py
datachain mv gs://amrit-datachain-test/playground/test.py gs://amrit-datachain-test/playground-2/test.py --studio-cloud-auth
datachain mv gs://amrit-datachain-test/playground gs://amrit-datachain-test/playground-2  --recursive
datachain mv gs://amrit-datachain-test/playground gs://amrit-datachain-test/playground-2  --recursive

RM

datachain rm gs://amrit-datachain-test/playground/test.py --recursive 
datachain rm gs://amrit-datachain-test/playground/test.py —studio-cloud-auth
datachain rm gs://amrit-datachain-test/playground --recursive 
datachain rm gs://amrit-datachain-test/playground --recursive --studio-cloud-auth

Please check the documentation on more details on this.

Summary by Sourcery

Enable storage file management via Studio by adding backend methods and CLI support for rm, mv, and cp operations, along with corresponding documentation.

New Features:

  • Add datachain storage rm, mv, and cp CLI commands for deleting, moving, and copying files via Studio
  • Introduce StudioClient methods for storage operations including delete, move, copy, presigned URL batch requests, download URL retrieval, and upload logging

Enhancements:

  • Extend CLI command routing and parser to support the new storage subcommands
  • Refactor datachain CLI to route storage commands through process_storage_command

Documentation:

  • Update mkdocs configuration and add user-facing documentation for storage rm, mv, and cp commands

Summary by Sourcery

Add CLI support for managing remote storage through Studio by introducing storage subcommands and backend methods for file operations

New Features:

  • Add datachain storage rm, mv, and cp CLI commands for deleting, moving, and copying files via Studio
  • Introduce StudioClient methods for delete_storage_file, move_storage_file, copy_storage_file, batch_presigned_urls, download_url, and save_upload_log

Enhancements:

  • Refactor CLI command routing to dispatch storage operations and handle local vs Studio workflows
  • Add remote/storage utilities for uploading and downloading files through fsspec with Studio integration
  • Update mkdocs configuration to include new storage commands

Documentation:

  • Add user documentation for cp, mv, and rm storage commands in mkdocs

Tests:

  • Add functional tests for storage rm, mv, and cp covering local-to-local, local-to-remote, remote-to-local, and remote-to-remote scenarios

@amritghimire amritghimire self-assigned this Jul 9, 2025
Copy link
Contributor

sourcery-ai bot commented Jul 9, 2025

Reviewer's Guide

This PR adds Studio-backed storage management to the CLI by introducing new storage subcommands (cp, mv, rm), wiring them through the command handler with local fallback logic, extending the StudioClient with REST endpoints for file operations, implementing client-side upload/download logic, and delivering full documentation and tests.

Sequence diagram for Studio-backed storage cp command

sequenceDiagram
    actor User
    participant CLI
    participant StudioClient
    participant StorageBackend
    User->>CLI: datachain storage cp source_path destination_path
    CLI->>CLI: Determine Studio/local mode
    alt Studio mode
        CLI->>StudioClient: copy_storage_file(source_path, destination_path, recursive)
        StudioClient->>StorageBackend: POST /storages/files/cp
        StorageBackend-->>StudioClient: Copy result
        StudioClient-->>CLI: Response
        CLI-->>User: Success/failure message
    else Local mode
        CLI->>CLI: Perform local copy
        CLI-->>User: Success/failure message
    end
Loading

Sequence diagram for Studio-backed storage mv and rm commands

sequenceDiagram
    actor User
    participant CLI
    participant StudioClient
    participant StorageBackend
    User->>CLI: datachain storage mv path new_path
    CLI->>StudioClient: move_storage_file(path, new_path, recursive)
    StudioClient->>StorageBackend: POST /storages/files/mv
    StorageBackend-->>StudioClient: Move result
    StudioClient-->>CLI: Response
    CLI-->>User: Success/failure message
    User->>CLI: datachain storage rm path
    CLI->>StudioClient: delete_storage_file(path, recursive)
    StudioClient->>StorageBackend: DELETE /storages/files
    StorageBackend-->>StudioClient: Delete result
    StudioClient-->>CLI: Response
    CLI-->>User: Success/failure message
Loading

Class diagram for new and updated storage management classes

classDiagram
    class StudioClient {
        +delete_storage_file(path, recursive)
        +move_storage_file(path, new_path, recursive)
        +copy_storage_file(path, new_path, recursive)
        +batch_presigned_urls(destination_path, paths)
        +download_url(path)
        +save_upload_log(path, logs)
    }
    class storages {
        +get_studio_client(args)
        +upload_to_storage(args, local_fs)
        +download_from_storage(args, local_fs)
        +copy_inside_storage(args)
    }
    class CLI_Commands_Storages {
        +rm_storage(args)
        +mv_storage(args)
        +cp_storage(args)
    }
    StudioClient <.. CLI_Commands_Storages : uses
    storages <.. CLI_Commands_Storages : uses
    StudioClient <.. storages : uses
Loading

File-Level Changes

Change Details Files
Add new storage subcommands and integrate into CLI
  • Define add_storage_parser to register `storage cp
mv
Implement CLI command handlers for storage operations
  • Create rm_storage, mv_storage, and cp_storage in a new commands module
  • Use get_studio_client and dispatch to remote or local logic based on source/dest protocols
src/datachain/cli/commands/storages.py
Extend StudioClient with REST methods for storage
  • Add delete_storage_file, move_storage_file, and copy_storage_file
  • Add batch presigned URL, download URL, and upload log endpoints
src/datachain/remote/studio.py
Add storage operations layer for upload/download and in-cloud copy
  • Implement upload_to_storage, download_from_storage, and copy_inside_storage
  • Handle presigned URL retrieval, multipart/form uploads, and streaming downloads
src/datachain/remote/storages.py
Update documentation and add functional tests
  • Update mkdocs.yml and add command docs for cp, mv, rm
  • Add comprehensive functional tests covering all storage scenarios
mkdocs.yml
docs/commands/cp.md
docs/commands/rm.md
docs/commands/mv.md
tests/func/test_storage_commands.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@amritghimire amritghimire marked this pull request as draft July 9, 2025 12:26
@amritghimire amritghimire requested a review from Copilot July 9, 2025 12:26
Copy link

cloudflare-workers-and-pages bot commented Jul 9, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: c0f84ea
Status: ✅  Deploy successful!
Preview URL: https://84b0e6cf.datachain-documentation.pages.dev
Branch Preview URL: https://amrit-storage-cli.datachain-documentation.pages.dev

View logs

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amritghimire - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/remote/studio.py:541` </location>
<code_context>
+    def batch_presigned_urls(
+        self, destination_path: str, paths: dict[str, str]
+    ) -> Response[PresignedUrlData]:
+        remote = urlparse(os.fspath(destination_path)).scheme
+        client = Client.get_implementation(destination_path)
+        remote = client.protocol
+        bucket, _ = client.split_url(destination_path)
+
</code_context>

<issue_to_address>
Redundant assignment to 'remote' variable.

The initial assignment using urlparse is unnecessary since 'remote' is immediately set to client.protocol. Please remove the redundant line.
</issue_to_address>

### Comment 2
<location> `src/datachain/cli/parser/studio.py:225` </location>
<code_context>
+        formatter_class=CustomHelpFormatter,
+    )
+
+    storage_cp_parser.add_argument(
+        "source_path",
+        action="store",
+        help="Path to the source file or directory to upload",
+    )
+
</code_context>

<issue_to_address>
Argument help text for 'source_path' and 'destination_path' may be misleading for copy operations.

Consider updating the help text to use 'copy' instead of 'upload' to better reflect all possible operations.

Suggested implementation:

```python
    storage_cp_parser.add_argument(
        "source_path",
        action="store",
        help="Path to the source file or directory to copy",
    )

```

If there is a `destination_path` argument defined in the same context, update its help text similarly, e.g.:
```python
help="Path to the destination file or directory to copy to"
```
</issue_to_address>

### Comment 3
<location> `src/datachain/cli/commands/storages.py:137` </location>
<code_context>
+            raise DataChainError(f"No presigned URL found for {dest_path}")
+
+        upload_url = urls[dest_path]["url"]
+        if "fields" in urls[dest_path]:
+            # S3 storage - use multipart form data upload
+
+            # Create form data
+            form_data = dict(urls[dest_path]["fields"])
+
+            # Add Content-Type if it's required by the policy
+            content_type = mimetypes.guess_type(source_path)[0]
+            if content_type:
+                form_data["Content-Type"] = content_type
+
+            # Add file content
+            file_content = local_fs.open(source_path, "rb").read()
+            form_data["file"] = (
+                os.path.basename(source_path),
+                file_content,
+                content_type,
+            )
+
+            # Upload using POST with form data
+            upload_response = requests.post(upload_url, files=form_data, timeout=3600)
+        else:
+            # Read the file content
</code_context>

<issue_to_address>
Multipart form data for S3 uploads may not be constructed correctly.

Separate form fields using the 'data' parameter and provide the file using the 'files' parameter in requests.post to ensure correct multipart upload to S3.
</issue_to_address>

### Comment 4
<location> `src/datachain/cli/commands/storages.py:168` </location>
<code_context>
+                response.data.get("method", "PUT"),
+                upload_url,
+                data=file_content,
+                headers={
+                    **headers,
+                    "Content-Type": mimetypes.guess_type(source_path)[0],
+                },
+                timeout=3600,
</code_context>

<issue_to_address>
Setting 'Content-Type' header to None if mimetype is not detected.

Omitting the 'Content-Type' header when the mimetype is not detected would prevent potential issues with storage providers.
</issue_to_address>

### Comment 5
<location> `src/datachain/cli/commands/storages.py:219` </location>
<code_context>
+    else:
+        destination_path = args.destination_path
+
+    with local_fs.open(destination_path, "wb") as f:
+        f.write(requests.get(url, timeout=3600).content)
+
+    print(f"Downloaded to {destination_path}")
</code_context>

<issue_to_address>
Downloading large files into memory before writing to disk may cause high memory usage.

Instead of reading the entire response into memory, use response.iter_content() to stream and write the file in chunks.
</issue_to_address>

### Comment 6
<location> `docs/commands/storage/rm.md:3` </location>
<code_context>
+# storage rm
+
+Delete files and directories in Storages using Studio.
+
+## Synopsis
</code_context>

<issue_to_address>
Change 'Storages' to 'storage' for grammatical correctness.

Use 'storage' instead of 'Storages' for correct grammar.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
Delete files and directories in Storages using Studio.
=======
Delete files and directories in storage using Studio.
>>>>>>> REPLACE

</suggested_fix>

### Comment 7
<location> `docs/commands/storage/mv.md:3` </location>
<code_context>
+# storage mv
+
+Move files and directories in Storages using Studio.
+
+## Synopsis
</code_context>

<issue_to_address>
Change 'Storages' to 'storage' for grammatical correctness.

Use 'storage' instead of 'Storages' for correct grammar in the description.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
Move files and directories in Storages using Studio.
=======
Move files and directories in storage using Studio.
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for managing files in remote storage via Studio through new CLI commands and backend methods.

  • Introduce delete_storage_file, move_storage_file, copy_storage_file, and related methods in StudioClient
  • Extend the CLI parser and process_storage_command to handle datachain storage rm|mv|cp
  • Add mkdocs entries and detailed documentation for the storage commands

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/datachain/remote/studio.py Add storage methods, import urlencode and Client
src/datachain/cli/parser/studio.py Define storage subcommands (rm, mv, cp)
src/datachain/cli/parser/init.py Register add_storage_parser
src/datachain/cli/commands/storages.py Implement handlers for rm, mv, cp operations
src/datachain/cli/init.py Wire storage command to process_storage_command
mkdocs.yml Add navigation entries for storage commands
docs/commands/storage/rm.md Add documentation for storage rm
docs/commands/storage/mv.md Add documentation for storage mv
docs/commands/storage/cp.md Add documentation for storage cp
Comments suppressed due to low confidence (6)

src/datachain/cli/parser/studio.py:240

  • [nitpick] The help text refers to 'Upload recursively' for the cp command; consider updating to 'Copy recursively' to accurately describe the operation.
        help="Upload recursively",

docs/commands/storage/rm.md:93

  • This note mentions 'Moving large directories' in the rm docs; it should say 'Deleting large directories' to match the command's behavior.
* Moving large directories may take time depending on the number of files and network conditions

src/datachain/remote/studio.py:541

  • The os module is used here but not imported in this file. Add import os at the top to avoid NameError.
        remote = urlparse(os.fspath(destination_path)).scheme

src/datachain/cli/commands/storages.py:55

  • This function doesn't return an exit code after successful deletion; consider returning 0 to indicate success for the CLI.
    print(f"Deleted {args.path}")

src/datachain/cli/commands/storages.py:149

  • [nitpick] Reading an entire file into memory can be inefficient for large files; consider streaming in chunks to reduce peak memory usage.
            file_content = local_fs.open(source_path, "rb").read()

docs/commands/storage/mv.md:13

  • There's an extra period after 'Studio'. Remove the duplicate '.' to fix the grammar.
This command moves files and directories within storage using the credentials configured in Studio.. The move operation is performed within the same bucket - you cannot move files between different buckets. The command supports both individual files and directories, with the `--recursive` flag required for moving directories.

Copy link

codecov bot commented Jul 9, 2025

Codecov Report

❌ Patch coverage is 70.94017% with 102 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.32%. Comparing base (25e4408) to head (c0f84ea).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/datachain/cli/commands/storage/local.py 13.43% 58 Missing ⚠️
src/datachain/remote/storages.py 78.40% 10 Missing and 9 partials ⚠️
src/datachain/cli/commands/storage/studio.py 82.14% 5 Missing and 5 partials ⚠️
src/datachain/cli/commands/storage/utils.py 70.58% 3 Missing and 2 partials ⚠️
src/datachain/lib/file.py 42.85% 3 Missing and 1 partial ⚠️
src/datachain/remote/studio.py 91.11% 2 Missing and 2 partials ⚠️
src/datachain/cli/commands/storage/base.py 92.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1221      +/-   ##
==========================================
- Coverage   88.75%   88.32%   -0.43%     
==========================================
  Files         155      161       +6     
  Lines       14143    14479     +336     
  Branches     1999     2033      +34     
==========================================
+ Hits        12552    12788     +236     
- Misses       1124     1204      +80     
- Partials      467      487      +20     
Flag Coverage Δ
datachain 88.25% <70.94%> (-0.43%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/cli/__init__.py 64.46% <100.00%> (+5.58%) ⬆️
src/datachain/cli/commands/storage/__init__.py 100.00% <100.00%> (ø)
src/datachain/cli/parser/__init__.py 98.27% <100.00%> (-0.10%) ⬇️
src/datachain/cli/parser/studio.py 100.00% <100.00%> (ø)
src/datachain/lib/dc/datachain.py 91.50% <ø> (ø)
src/datachain/cli/commands/storage/base.py 92.00% <92.00%> (ø)
src/datachain/lib/file.py 88.01% <42.85%> (-0.65%) ⬇️
src/datachain/remote/studio.py 83.05% <91.11%> (+2.24%) ⬆️
src/datachain/cli/commands/storage/utils.py 70.58% <70.58%> (ø)
src/datachain/cli/commands/storage/studio.py 82.14% <82.14%> (ø)
... and 2 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@shcheklein
Copy link
Member

@amritghimire let's check first the existing APIs - datachain cp and datachain ls - they should be taking this role

@amritghimire
Copy link
Contributor Author

@amritghimire let's check first the existing APIs - datachain cp and datachain ls - they should be taking this role

Yes, I am looking into that too. I wanted to implement the studio specific part and merge those.

@amritghimire
Copy link
Contributor Author

@amritghimire let's check first the existing APIs - datachain cp and datachain ls - they should be taking this role

@shcheklein What do you propose of the syntax on how to handle both studio or local with this approach?
I mean how to identify if we need to use studio or local?

@amritghimire
Copy link
Contributor Author

@sourcery-ai review

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amritghimire - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `src/datachain/remote/studio.py:540` </location>
<code_context>
+    def batch_presigned_urls(
+        self, destination_path: str, paths: dict[str, str]
+    ) -> Response[PresignedUrlData]:
+        remote = urlparse(os.fspath(destination_path)).scheme
+        client = Client.get_implementation(destination_path)
+        remote = client.protocol
+        bucket, _ = client.split_url(destination_path)
+
</code_context>

<issue_to_address>
Redundant assignment to 'remote' variable.

Remove the initial assignment to 'remote' using urlparse, as it is immediately overwritten by client.protocol.
</issue_to_address>

### Comment 2
<location> `src/datachain/remote/studio.py:556` </location>
<code_context>
+        )
+
+    def download_url(self, path: str) -> Response[FileUploadData]:
+        remote = urlparse(os.fspath(path)).scheme
+        client = Client.get_implementation(path)
+        remote = client.protocol
+        bucket, subpath = client.split_url(path)
+
</code_context>

<issue_to_address>
Redundant assignment to 'remote' variable.

The initial assignment using urlparse is unnecessary since 'remote' is immediately set to client.protocol. Please remove the redundant line.
</issue_to_address>

### Comment 3
<location> `src/datachain/remote/storages.py:145` </location>
<code_context>
+    content_type = mimetypes.guess_type(source_path)[0]
+    form_data["Content-Type"] = str(content_type)
+
+    file_content = local_fs.open(source_path, "rb").read()
+    form_data["file"] = (
+        os.path.basename(source_path),
</code_context>

<issue_to_address>
Reading entire file into memory may cause issues with large files.

Consider using a streaming upload approach to handle large files more efficiently, if supported by your backend and the requests library.
</issue_to_address>

### Comment 4
<location> `src/datachain/remote/storages.py:163` </location>
<code_context>
+    local_fs: "AbstractFileSystem",
+):
+    """Upload file using direct HTTP request."""
+    with local_fs.open(source_path, "rb") as f:
+        file_content = f.read()
+
+    return requests.request(
</code_context>

<issue_to_address>
Entire file is read into memory for direct uploads.

Consider using a file-like object or streaming upload to handle large files more efficiently.
</issue_to_address>

### Comment 5
<location> `docs/commands/storage/rm.md:93` </location>
<code_context>
+
+## Notes
+
+* Moving large directories may take time depending on the number of files and network conditions
+* Use the `--verbose` flag to get detailed information about the move operation
+* The `--quiet` flag suppresses output except for errors
</code_context>

<issue_to_address>
Note refers to 'Moving large directories' in the rm (remove) command.

Update the note to refer to deleting large directories instead of moving them, as this is more relevant to the 'rm' command.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
* Moving large directories may take time depending on the number of files and network conditions
* Use the `--verbose` flag to get detailed information about the delete operation
=======
* Deleting large directories may take time depending on the number of files and network conditions
* Use the `--verbose` flag to get detailed information about the delete operation
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@amritghimire
Copy link
Contributor Author

@sourcery-ai guide

@amritghimire amritghimire marked this pull request as ready for review July 16, 2025 08:11
@amritghimire amritghimire requested a review from a team July 16, 2025 08:11
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amritghimire - I've reviewed your changes - here's some feedback:

  • The new storage subcommands repeat very similar payload-building and request logic—consider extracting a shared helper in StudioClient to reduce duplication between delete, move, and copy methods.
  • I don’t see a handler mapping for the “cp” storage subcommand in handle_command; please verify that invoking “storage cp” correctly dispatches to cp_storage.
  • The mix of URL query parameters (in delete) and JSON POST bodies (in move/copy) could lead to inconsistent encoding—consider standardizing how requests are composed for all storage operations.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new storage subcommands repeat very similar payload-building and request logic—consider extracting a shared helper in StudioClient to reduce duplication between delete, move, and copy methods.
- I don’t see a handler mapping for the “cp” storage subcommand in handle_command; please verify that invoking “storage cp” correctly dispatches to cp_storage.
- The mix of URL query parameters (in delete) and JSON POST bodies (in move/copy) could lead to inconsistent encoding—consider standardizing how requests are composed for all storage operations.

## Individual Comments

### Comment 1
<location> `src/datachain/cli/parser/studio.py:149` </location>
<code_context>
+        formatter_class=CustomHelpFormatter,
+    )
+
+    storage_cp_parser.add_argument(
+        "source_path",
+        action="store",
+        help="Path to the source file or directory to copy",
+    ).complete = shtab.DIR  # type: ignore[attr-defined]
+
+    storage_cp_parser.add_argument(
</code_context>

<issue_to_address>
Tab completion is only set for source_path, not destination_path.

Enable directory completion for destination_path to match source_path and enhance CLI consistency.
</issue_to_address>

### Comment 2
<location> `src/datachain/remote/studio.py:489` </location>
<code_context>
+            "paths": subpath,
+        }
+
+        url = f"datachain/storages/files?{urlencode(data)}"
+
+        return self._send_request(url, data, method="DELETE")
+
+    def move_storage_file(
</code_context>

<issue_to_address>
DELETE request sends data in both query string and body.

Some servers may not support a body in DELETE requests. Please confirm backend compatibility or use only query parameters if possible.
</issue_to_address>

### Comment 3
<location> `tests/func/test_storage_commands.py:106` </location>
<code_context>
+def test_cp_storage_local_to_s3(requests_mock, capsys, studio_token, tmp_dir):
</code_context>

<issue_to_address>
No test for upload failure or error handling.

Please add a test case that simulates a failed upload (e.g., 400 or 500 response) to verify correct error handling and user feedback.
</issue_to_address>

### Comment 4
<location> `tests/func/test_storage_commands.py:154` </location>
<code_context>
+    }
+
+
+def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
+    requests_mock.get(
+        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
+        json={
+            "url": "https://example.com/download",
+        },
+    )
+    requests_mock.get(
+        "https://example.com/download",
+        content=b"file1",
+    )
+
+    result = main(
+        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
+    )
+    assert result == 0
+    assert (tmp_dir / "file1.txt").read_text() == "file1"
+
+    history = requests_mock.request_history
</code_context>

<issue_to_address>
No test for download failure or missing URL.

Add tests for cases where the download_url endpoint returns an error or omits the 'url' field to verify error handling in download_from_storage.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
    requests_mock.get(
        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
        json={
            "url": "https://example.com/download",
        },
    )
    requests_mock.get(
        "https://example.com/download",
        content=b"file1",
    )

    result = main(
        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
    )
    assert result == 0
    assert (tmp_dir / "file1.txt").read_text() == "file1"

    history = requests_mock.request_history
=======
def test_cp_remote_to_local(requests_mock, capsys, studio_token, tmp_dir):
    requests_mock.get(
        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
        json={
            "url": "https://example.com/download",
        },
    )
    requests_mock.get(
        "https://example.com/download",
        content=b"file1",
    )

    result = main(
        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
    )
    assert result == 0
    assert (tmp_dir / "file1.txt").read_text() == "file1"

    history = requests_mock.request_history

def test_cp_remote_to_local_download_error(requests_mock, capsys, studio_token, tmp_dir):
    # Simulate error from download_url endpoint
    requests_mock.get(
        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
        status_code=500,
        json={"error": "Internal Server Error"},
    )

    result = main(
        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
    )
    assert result != 0
    captured = capsys.readouterr()
    assert "Internal Server Error" in captured.err or "500" in captured.err

def test_cp_remote_to_local_missing_url(requests_mock, capsys, studio_token, tmp_dir):
    # Simulate missing 'url' in response
    requests_mock.get(
        f"{STUDIO_URL}/api/datachain/storages/files/download?bucket=my-bucket&remote=s3&filepath=data%2Fcontent&team=team_name&team_name=team_name",
        json={},
    )

    result = main(
        ["storage", "cp", "s3://my-bucket/data/content", str(tmp_dir / "file1.txt")]
    )
    assert result != 0
    captured = capsys.readouterr()
    assert "url" in captured.err or "No download URL" in captured.err
>>>>>>> REPLACE

</suggested_fix>

### Comment 5
<location> `tests/func/test_storage_commands.py:64` </location>
<code_context>
+        ),
+    ],
+)
+def test_mv_storage(requests_mock, capsys, studio_token, command, recursive, team):
+    requests_mock.post(
+        f"{STUDIO_URL}/api/datachain/storages/files/mv",
+        json={"ok": True, "data": {"moved": True}, "message": "", "status": 200},
+        status_code=200,
+    )
+
+    result = main(["storage", "mv", "s3://my-bucket/data/content", *command.split()])
+    assert result == 0
+    out, _ = capsys.readouterr()
+    assert "Moved s3://my-bucket/data/content to s3://my-bucket/data/content2" in out
+
+    assert requests_mock.called
+    assert requests_mock.last_request.json() == {
+        "bucket": "my-bucket",
+        "newPath": "data/content2",
</code_context>

<issue_to_address>
Test for move failure is missing.

Add a test case where the move endpoint returns a failure (e.g., ok: False or a 4xx/5xx status) to verify proper CLI error handling and user feedback.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Review docs carefully and with attention

@amritghimire amritghimire requested a review from shcheklein July 21, 2025 08:36
@amritghimire amritghimire requested a review from a team August 1, 2025 18:35
@amritghimire
Copy link
Contributor Author

@shcheklein PTAL

assert history[1].url == "https://example.com/download"


def test_cp_s3_to_s3(requests_mock, capsys, studio_token, tmp_dir):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is probably not a functional test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and review other tests here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? From perspective of datachain, it starts with the cli invocation (main) and reaches to studio (third party).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function == actually doing some action (e.g. actually creates files). In this case everything is mocked, even CLI is called via direct main, not a subprocess or something

(usually, in most cases if you have mocks you are not doing a functional test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I agree, if thats the case every tests we have that tests we are calling correct api calls to third party should be moved elsewhere. Just a quick search shows other tests as we have https://github.com/search?q=repo%3Aiterative%2Fdatachain+mock+path%3Atests%2Ffunc&type=code.

It is still functional because it is doing some function i.e. calling correct API IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the logic to unit tests folder. We have single tests for a each functions.

I still don't believe this is unit test. Unit test usually isolates and tests a single function or code segment. Here we are testing complete E2E flow from starting entry to our codebase to the end segment of our test segment while mocking the part where we go to third party like studio. To make this functional according to you, we must actually call studio api and test its response i.e. we need to have a studio instance.

But since this PR is already stuck for more than a month, I moved the tests to unit tests. We still have few tests that uses mock in functional tests like https://github.com/iterative/datachain/blob/main/tests/test_cli_e2e.py and as seen in above search. LMK if we want to move those to unit tests too. We can create a followup PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we are testing much in this case tbh ... we just test some implementation details (that a particular API is called with some specific arguments, right? Those are not functional tests. We have actual cloud implementations, and can use those to test things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is to test everything is working for Studio based credentials which is why this checks that a particular API is called. If you have any idea on how to test that otherwise Please let me know?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we don functional tests - use actual files and storages. In Studio mode - generate actual signed URLs

(not saying we have to do this, but current tests are not very meaningful to my mind)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use actual files and storages

We already have one for actual files and storages under test_cli_e2e with actual storage s3. This is one for studio.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have one for actual files and storages under test_cli_e2e with actual storage s3. This is one for studio.

does it test what we need or some legacy stuff? again, I don't think it is very valuable to see that studio API is called. Even if we keep it - let's at least keep one per command (simplify), let's move them to unit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it test what we need or some legacy stuff?

It tests what we want. It copies all subdirectories and file tree from s3 bucket to local and check if the tree resulting is correct one.

@amritghimire
Copy link
Contributor Author

@shcheklein Would love your review

@shcheklein
Copy link
Member

@amritghimire it is on my list, I'll get to it as soon as I can. Thanks!


## Description

This command copies files and directories between local and/or remote storage. This uses the credentials in your system by default or can use the cloud authentication from Studio.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I'm missing something, but what was the motivation to develop this on our own in datachain CLI when there are already some tools doing the same thing and are more specialized (e.g https://github.com/rclone/rclone)?

To me this almost pollutes our CLI / API as it adds completely different domain that doesn't look related to datasets (and other related things) .. not to mention it's added at root level of CLI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the main requirements are these two:

  1. To be able to track the changes made from CLI in studio file browser as requested by the consumer
  2. To be able to use the credentials stored in Studio to perform file operations

For that, this is used.

@@ -2517,6 +2517,7 @@ def to_storage(
num_threads: Optional[int] = EXPORT_FILES_MAX_THREADS,
anon: Optional[bool] = None,
client_config: Optional[dict] = None,
relative_to: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this actually does? Also, it's missing in docstring of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you try to copy the files from a storage to another storage using to_storage, we used to be able to use two different relevant value to placement. One is filename, another is fullpath. With filename, all files are dumped in parent directory i.e. file tree is not maintained. With full path, the full path is also dumped. To implement cp, we needed to be able to maintain the tree relative to certain path.

i.e. If we copy a file from directory/deep/nested/test.txt using datachain cp -r directory/ destination/ it should create directory under destination using deep/nested/ and create file there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only about single file case? Not sure why the full path mode was not enough tbh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only about single file case? Not sure why the full path mode was not enough tbh?

Not single file but nested file tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only about single file case? Not sure why the full path mode was not enough tbh?

Not single file but nested file tree.

Will add example demonstration tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine a tree for the gs://amrit-datachain-test/playground/test_folder path as:

├── deep
│   └── nested
│       └── test.py
├── test_1.txt
├── test_2.txt
├── test-1.py
└── test-4.py

3 directories, 5 files

Now, if I copy the file to a output with following code:

import datachain as dc
dc.read_storage("gs://amrit-datachain-test/playground/test_folder").to_storage("/tmp/copy_destination", placement="fullpath")

It gave the following tree:

/tmp/copy_destination
└── amrit-datachain-test
    └── playground
        └── test_folder
            ├── deep
            │   └── nested
            │       └── test.py
            ├── test_1.txt
            ├── test_2.txt
            ├── test-1.py
            └── test-4.py

6 directories, 5 files

What we wanted to achieve was as:

import datachain as dc
dc.read_storage("gs://amrit-datachain-test/playground/test_folder").to_storage(
    "/tmp/copy_destination_normpath",
    placement="normpath",
    relative_to="playground/test_folder"
)

This gave the following output:

/tmp/copy_destination_normpath
├── deep
│   └── nested
│       └── test.py
├── test_1.txt
├── test_2.txt
├── test-1.py
└── test-4.py

3 directories, 5 files

cc. @ilongin @shcheklein

- Requires `--recursive` flag for directories

```bash
# Copy within same bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to copy from one remote (bucket or even remote type) to another one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

"team": self.team,
}

return self._send_request("datachain/storages/files/mv", data, method="POST")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably unify this with ls command that we have for storages. It has a different API atm AFAIU from a different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably unify this with ls command that we have for storages. It has a different API atm AFAIU from a different PR

This is for moving files, ls is for listing files. We don't have separate ls endpoint or feature implemented in this PR.

"team": self.team,
}
return self._send_request(
"datachain/storages/batch-presigned-urls", data, method="POST"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what scope btw is required for all these commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STORAGES scope.



def upload_to_storage(
source_path: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we operate with File object here? can we do the same for dest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of creating new wrappers like UploadFileInfo, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly at this moment because it can either be a directory or a file. We form the chain using this information below.

destination_path: str,
team: Optional[str] = None,
recursive: bool = False,
local_fs: "AbstractFileSystem" = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can DataChain handle this internally via File abstraction? we must do a maximum effort to use it across both scenarios, instead of manipulating low level things like file systems directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local filesystem here is to identify if it is a directory or not and build the file paths necessary.

return file_paths


def download_from_storage(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use to_storage? let's try to scope what would it take for it to work through Studio? can it be for example some abstract Studio file system or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, again because we need to pass the credentials stored in studio and that is not a good idea IMO. Adding studio file system is whole another thing IMO.

update = True
relative_to = None

is_file = source_fs.isfile(source_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try to decide on this inside function (e.g. if this is a directory path ends in chain with / or we have many files for a single source, if this is a file we can potentially also see it?)

bottomline - DataChain should be providing proper abstractions / tools to work with this ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the bottomline, but this PR is already big enough to work on that, Can be followup, but for now, it doesn't have a way to copy the file properly for a single file. For file, we use file.save as we have in datachain and for folder, we use the to_storage part.

print(f"Deleted {self.path}")

def mv(self):
from datachain.client.fsspec import Client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can DataChain also be used here (e.g. to handle scale since we are saving files, properly parallellize, etc) ... also to unify with Studio code path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we don't do move with datachain, but copying files. For move operation, it is always simple to use filesystem since it usually is just changing the references underneath. The file itself will not be physically moved, so the operation will be almost instantaneous in majority of local move cases.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an amazing progress, @amritghimire ... in the right direction. I'm still hesitant merging a few 100s of new code that add a few custom abstractions, two different paths with different implementations + quite a lot of low level manipulations.

It might be inevitable, but I shared a few ideas in some comments to try. Let's please please try to do this on top of the existing abstractions (improving them if needed) also unify code paths

Thanks a lot for your patience here.

This adds the support for following command:
```
usage: datachain storage cp [-h] [-v] [-q] [--recursive] [--team TEAM] source_path destination_path
```

```
usage: datachain storage mv [-h] [-v] [-q] [--recursive] [--team TEAM] path new_path
```
usage: datachain storage cp [-h] [-v] [-q] [--recursive] [--team TEAM] source_path destination_path
```

Please check the documentation on more details on this.

I am not sure of the cli command storage as it seems too long. At the
same time, we already have cp which do something differently.

Also, should we fallback to local creds and do something locally if
studio auth is not available?

Make things much simpler

Fix mypy

Address comments

Fix lint

Add tests

Merge with top level cp

Update src/datachain/cli/__init__.py

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

Update test

Storage cp test fix

Reword something

Change all approach

Refactor everything and address comments

Check lint

Fix tests

Address comments

Use datachain locally

Cleanup

Address cosmetic changes

Remove get_studio_client

Move to unit test

Use async mapper

Use datachain

Update tests

Update cp to use datachain for local as well

Resurrect anon flag

Move args out

Update src/datachain/remote/storages.py

Typing

Update local.py

Fix test
@amritghimire
Copy link
Contributor Author

@shcheklein Can you please take a final look and let me know what we want to change if anything?

@shcheklein
Copy link
Member

@amritghimire yep, I remember about this one. It is the only PR probably where I'll be very conservative, so please be patient with me a bit :). As I mentioned I'm not comfortable merging quite a lot of stuff where it seems DataChain core should be handling things. I'll check one more asap. It just takes a bit more effort to give some meaningful feedback, since it's not an easy one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants