Skip to content

[Feature] Add producer-consumer pipeline for uploading folder#1671

Merged
wangxingjun778 merged 12 commits into
modelscope:masterfrom
wangxingjun778:feat/upload_speedup
Apr 13, 2026
Merged

[Feature] Add producer-consumer pipeline for uploading folder#1671
wangxingjun778 merged 12 commits into
modelscope:masterfrom
wangxingjun778:feat/upload_speedup

Conversation

@wangxingjun778
Copy link
Copy Markdown
Member

@wangxingjun778 wangxingjun778 commented Apr 10, 2026

New Features & Enhancements

  • Resumable Upload Support: Refactored file upload process to allow interruptions and resumption.
  • Pipelined Batch Commits: Implemented pipelining for more efficient batch data commits.
  • Concurrent Upload Management: Introduced BatchTracker to handle concurrent uploads effectively.
  • Optimized Large File Hashing: Added compute_file_hash utility using async I/O for better performance with large files.
  • Persistent Hashing: Introduced UploadHashCache to maintain hash states persistently.
  • Progress Tracking: Implemented UploadCheckpoint to track upload progress accurately.
  • Robust Retry Logic: Updated upload_folder method with exponential backoff retries for improved reliability.

Code Quality & Maintenance

  • Improved Error Handling: Narrowed exception handling in retry logic to prevent masking unexpected errors (based on review feedback).
  • Configuration Cleanup: Removed redundant CSV field size limit configurations across multiple files.
  • Deprecation Management: Marked the legacy get_file_hash function as deprecated to encourage migration to the new utility.

@wangxingjun778 wangxingjun778 changed the title Feat/upload speedup [Feature] Add producer-consumer pipeline for uploading folder Apr 10, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the file upload process to support resumable uploads and pipelined batch commits. It introduces UploadHashCache for persistent hashing, UploadCheckpoint for progress tracking, and BatchTracker for concurrent upload management. The upload_folder method is updated with exponential backoff retries and a new compute_file_hash utility that optimizes large file hashing using async I/O. Review feedback recommends narrowing exception handling in the retry logic to avoid masking errors, removing redundant CSV field size limit configurations across files, and marking the legacy get_file_hash function as deprecated.

Comment thread modelscope/hub/api.py
Comment thread modelscope/msdatasets/download/dataset_builder.py
Comment thread modelscope/utils/file_utils.py
Comment thread modelscope/hub/constants.py
Comment thread modelscope/hub/constants.py
Comment thread modelscope/hub/upload_checkpoint.py Outdated
Comment thread modelscope/hub/api.py Outdated
@wangxingjun778 wangxingjun778 merged commit 25de84c into modelscope:master Apr 13, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants