Skip to content

[Fix] Fix split detection#1714

Merged
wangxingjun778 merged 3 commits into
modelscope:masterfrom
wangxingjun778:fix/split_valid
May 13, 2026
Merged

[Fix] Fix split detection#1714
wangxingjun778 merged 3 commits into
modelscope:masterfrom
wangxingjun778:fix/split_valid

Conversation

@wangxingjun778
Copy link
Copy Markdown
Member

@wangxingjun778 wangxingjun778 commented May 13, 2026

  • I/O-Free Split Discovery: Introduces a dry-run mechanism to extract dataset split names without performing actual I/O operations.
  • Dry-Run Download Manager: Implements a _DryRunDownloadManager stub to facilitate lightweight split detection.
  • Discovery Fallback Logic: Updates load_dataset and _validate_split_exists to utilize the dry-run method as a fallback for script-based datasets.
  • Cross-Platform Compatibility: Enhances environment support by utilizing os.devnull.
  • Extended Type Support: Adds support for set types within the download manager operations.
  • Improved Observability: Introduces debug logging to capture and track discovery errors during the dry-run process.
  • Robust Config Handling: Ensures the discovery process correctly handles list-based builder_configs.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances split discovery for Hugging Face datasets by introducing a dry-run mechanism that extracts split names without performing actual I/O. It adds a _DryRunDownloadManager stub and updates _validate_split_exists and load_dataset to utilize this discovery method as a fallback for script-based datasets. The review feedback suggests enhancing cross-platform compatibility by using os.devnull, adding support for set types in the download manager, improving observability through debug logging of discovery errors, and ensuring robustness by handling list-based builder_configs.

Comment thread modelscope/msdatasets/utils/hf_datasets_util.py Outdated
Comment thread modelscope/msdatasets/utils/hf_datasets_util.py Outdated
Comment thread modelscope/msdatasets/utils/hf_datasets_util.py Outdated
Comment thread modelscope/msdatasets/utils/hf_datasets_util.py Outdated
@wangxingjun778 wangxingjun778 merged commit d968a2c into modelscope:master May 13, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants