Skip to content

add metax dockerfile and its requirements for ms-swift 4.2.x#1734

Merged
Jintao-Huang merged 4 commits into
modelscope:masterfrom
WendaDeng:add-metax-4.2-dockerfiles
Jun 15, 2026
Merged

add metax dockerfile and its requirements for ms-swift 4.2.x#1734
Jintao-Huang merged 4 commits into
modelscope:masterfrom
WendaDeng:add-metax-4.2-dockerfiles

Conversation

@WendaDeng

@WendaDeng WendaDeng commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

add metax dockerfiles and related build scripts and requirements for ms-swift 4.2.x

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Dockerfiles, build scripts, and documentation to support building MS-Swift v4.2.3 images for Metax accelerators, offering both a full build from a UBI9 base image and a faster build using a prebuilt Metax release image. The review feedback focuses on optimizing the Dockerfiles by consolidating multiple RUN instructions to reduce image layers, removing redundant package installations and debug statements, and fixing a potential build failure in Dockerfile.metax by properly initializing the cu-bridge build process with its pre_make script.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +95 to +104
RUN cd /tmp/ && \
export MACA_PATH=/opt/maca && \
curl -o ${CU_BRIDGE_VERSION}.zip -LsSf https://gitee.com/metax-maca/cu-bridge/repository/archive/${CU_BRIDGE_VERSION}.zip && \
unzip ${CU_BRIDGE_VERSION}.zip && \
mv cu-bridge-${CU_BRIDGE_VERSION} cu-bridge && \
chmod 755 cu-bridge -Rf && \
cd cu-bridge && \
mkdir build && cd build && \
cmake -DCMAKE_INSTALL_PREFIX=/opt/maca/tools/cu-bridge ../ && \
make && make install

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cu-bridge build process needs to be initialized by running its pre_make script. Without this initialization, the /root/cu-bridge/CUDA_DIR directory (referenced by CUDA_PATH on line 19) is never created. As a result, the symbolic link creation on line 160 (RUN ln -sf ${CUDA_PATH}/bin/nvcc ${CUDA_PATH}/bin/cucc) will fail because the destination directory does not exist.

Additionally, we should use the standard -R flag for chmod instead of the non-standard -Rf syntax at the end of the command.

RUN cd /tmp/ && \
    export MACA_PATH=/opt/maca && \
    curl -o ${CU_BRIDGE_VERSION}.zip -LsSf https://gitee.com/metax-maca/cu-bridge/repository/archive/${CU_BRIDGE_VERSION}.zip && 
    unzip ${CU_BRIDGE_VERSION}.zip && \
    mv cu-bridge-${CU_BRIDGE_VERSION} cu-bridge && \
    chmod -R 755 cu-bridge && \
    cd cu-bridge && \
    mkdir build && cd build && \
    cmake -DCMAKE_INSTALL_PREFIX=/opt/maca/tools/cu-bridge ../ && \
    make && make install && \
    /opt/maca/tools/cu-bridge/tools/pre_make

Comment thread docker/Metax/4.2/Dockerfile.metax Outdated
Comment on lines +54 to +93
RUN printf "[metax-centos]\n\
name=Maca Driver Yum Repository\n\
baseurl=https://repos.metax-tech.com/r/metax-driver-centos-$(uname -m)/\n\
enabled=1\n\
gpgcheck=0" > /etc/yum.repos.d/metax-driver-centos.repo

RUN dnf -y install python3-pip hostname && \
dnf clean all

RUN python3 -m pip install uv -i $UV_INDEX_URL --trusted-host ${UV_TRUSTED_INDEX_HOST} && \
uv venv /opt/venv --python=${PYTHON_VERSION}

RUN python3 --version && \
uv self version

RUN yum install -y \
unzip vim git openblas-devel make cmake \
ninja-build gcc g++ procps-ng \
libibverbs librdmacm libibumad \
&& yum clean all

RUN git clone --depth 1 --branch ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git
RUN git clone --depth 1 --branch ${VLLM_METAX_VERSION} https://github.com/MetaX-MACA/vLLM-metax.git
RUN git clone --depth 1 --branch ${VLLM_VERSION} https://github.com/vllm-project/vllm.git
RUN git clone --depth 1 --branch ${MEGATRON_VERSION} https://github.com/NVIDIA/Megatron-LM.git

# Step 1: install MACA SDK, Metax-Driver and cu-bridge
# Metax-Driver mainly contains vbios and kmd files, which are not needed in a container.
# Here we keep the mx-smi management tool. Kernel version mismatch errors are ignored.
RUN yum install -y metax-driver-${MACA_VERSION}* mxgvm && \
yum clean all && rm -rf /var/cache/yum /tmp/*

RUN printf "[maca-sdk]\n\
name=Maca Sdk Yum Repository\n\
baseurl=https://repos.metax-tech.com/r/maca-sdk-rpm-$(uname -m)/\n\
enabled=1\n\
gpgcheck=0" > /etc/yum.repos.d/maca-sdk-rpm.repo

RUN yum install -y maca_sdk-${MACA_VERSION}* && \
yum clean all && rm -rf /var/cache/yum /tmp/*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To optimize the Docker image build time and reduce the number of layers, we can combine the repository configurations, package installations, and git clones into fewer RUN instructions.

By adding both repositories first, we can install all required system packages (including metax-driver and maca_sdk) in a single yum install command. This also allows us to install binutils and numactl-libs early, completely eliminating the redundant yum install step later in the file.

RUN printf "[metax-centos]\n\
name=Maca Driver Yum Repository\n\
baseurl=https://repos.metax-tech.com/r/metax-driver-centos-$(uname -m)/\n\
enabled=1\n\
gpgcheck=0" > /etc/yum.repos.d/metax-driver-centos.repo && \
    printf "[maca-sdk]\n\
name=Maca Sdk Yum Repository\n\
baseurl=https://repos.metax-tech.com/r/maca-sdk-rpm-$(uname -m)/\n\
enabled=1\n\
gpgcheck=0" > /etc/yum.repos.d/maca-sdk-rpm.repo

RUN yum install -y \
    python3-pip hostname \
    unzip vim git openblas-devel make cmake \
    ninja-build gcc g++ procps-ng \
    libibverbs librdmacm libibumad \
    binutils numactl-libs \
    metax-driver-${MACA_VERSION}* mxgvm \
    maca_sdk-${MACA_VERSION}* \
    && yum clean all && rm -rf /var/cache/yum /tmp/*

RUN python3 -m pip install uv -i $UV_INDEX_URL --trusted-host ${UV_TRUSTED_INDEX_HOST} && \
    uv venv /opt/venv --python=${PYTHON_VERSION}

RUN python3 --version && \
    uv self version

RUN git clone --depth 1 --branch ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git && \
    git clone --depth 1 --branch ${VLLM_METAX_VERSION} https://github.com/MetaX-MACA/vLLM-metax.git && \
    git clone --depth 1 --branch ${VLLM_VERSION} https://github.com/vllm-project/vllm.git && \
    git clone --depth 1 --branch ${MEGATRON_VERSION} https://github.com/NVIDIA/Megatron-LM.git

Comment thread docker/Metax/4.2/Dockerfile.metax Outdated
Comment on lines +112 to +121
RUN yum install -y \
gcc \
binutils \
procps-ng \
libibverbs \
librdmacm \
libibumad \
openblas \
numactl-libs \
&& yum clean all && rm -rf /var/cache/yum /tmp/*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This entire yum install block is redundant and can be safely removed because all of these packages (or their development headers/dependencies) are already installed in the combined initial setup step.

# System dependencies are already installed in the base setup step.

Comment on lines +128 to +133
RUN cd vllm && \
python3 use_existing_torch.py && \
uv pip install -r requirements/build/cuda.txt

RUN cd vllm && \
VLLM_TARGET_DEVICE=empty uv pip install -v . --no-build-isolation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These two RUN blocks can be combined into a single RUN instruction to reduce the number of image layers and avoid redundant directory changes (cd vllm).

RUN cd vllm && \
    python3 use_existing_torch.py && \
    uv pip install -r requirements/build/cuda.txt && \
    VLLM_TARGET_DEVICE=empty uv pip install -v . --no-build-isolation

Comment on lines +27 to +28
RUN echo $PATH
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This debug statement is unnecessary for the production image build and adds an extra layer. It should be removed.

RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*

Comment on lines +35 to +41
# Clone all GitHub sources while the external proxy is enabled.
RUN rm -rf /workspace/ms-swift /workspace/vLLM-metax /workspace/vllm /workspace/Megatron-LM

RUN git clone --depth 1 --branch ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git
RUN git clone --depth 1 --branch ${VLLM_METAX_VERSION} https://github.com/MetaX-MACA/vLLM-metax.git
RUN git clone --depth 1 --branch ${VLLM_VERSION} https://github.com/vllm-project/vllm.git
RUN git clone --depth 1 --branch ${MEGATRON_VERSION} https://github.com/NVIDIA/Megatron-LM.git

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

We can combine the cleanup and the multiple git clone commands into a single RUN instruction to minimize the number of intermediate image layers.

RUN rm -rf /workspace/ms-swift /workspace/vLLM-metax /workspace/vllm /workspace/Megatron-LM && \
    git clone --depth 1 --branch ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git && \
    git clone --depth 1 --branch ${VLLM_METAX_VERSION} https://github.com/MetaX-MACA/vLLM-metax.git && \
    git clone --depth 1 --branch ${VLLM_VERSION} https://github.com/vllm-project/vllm.git && \
    git clone --depth 1 --branch ${MEGATRON_VERSION} https://github.com/NVIDIA/Megatron-LM.git

Comment on lines +46 to +53
# Step 1: build original vLLM for torch setup
RUN cd vllm && \
python3 use_existing_torch.py && \
pip install -r requirements/build/cuda.txt

# Step 2: build vLLM with empty device to avoid CUDA dependency
RUN cd vllm && \
VLLM_TARGET_DEVICE=empty pip install -v . --no-build-isolation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These two RUN blocks can be combined into a single RUN instruction to reduce the number of image layers and avoid redundant directory changes (cd vllm).

RUN cd vllm && \
    python3 use_existing_torch.py && \
    pip install -r requirements/build/cuda.txt && \
    VLLM_TARGET_DEVICE=empty pip install -v . --no-build-isolation

@Jintao-Huang

Copy link
Copy Markdown
Collaborator

Please run the lint test first with the following commands:

pip install pre-commit
pre-commit run --all-files

@Jintao-Huang Jintao-Huang merged commit 37eb369 into modelscope:master Jun 15, 2026
2 checks passed
Yunnglin added a commit that referenced this pull request Jun 26, 2026
* Upgrade numpy to 2.x for 1.38 Docker images

- Replace deprecated numpy aliases (np.math.ceil → math.ceil, np.Inf → np.inf)
- Upgrade Docker constraints: numpy>=2.0, cython>=3.0, remove scipy upper bound

* Add ipywidgets dependency to Docker images

* update docker

* fix

* fix

* fix

* fix cpu image

* fix(docker): force numpy>=2.0 after evalscope install

ms-opencompass pulls numpy<2.0, downgrading numpy from 2.x to 1.26.4.
Force reinstall numpy>=2.0 after pip install .[eval] to restore it.

* add metax dockerfile and its requirements for ms-swift 4.2.x (#1734)

* Update npu dockerfile (#1736)

* fix(docker): remove unrelated added docker files

---------

Co-authored-by: Jintao Huang <huangjintao.hjt@alibaba-inc.com>
Co-authored-by: dwd <dwd1044898101@gmail.com>
Co-authored-by: addsubmuldiv <zyh13227@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants