add metax dockerfile and its requirements for ms-swift 4.2.x#1734
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Dockerfiles, build scripts, and documentation to support building MS-Swift v4.2.3 images for Metax accelerators, offering both a full build from a UBI9 base image and a faster build using a prebuilt Metax release image. The review feedback focuses on optimizing the Dockerfiles by consolidating multiple RUN instructions to reduce image layers, removing redundant package installations and debug statements, and fixing a potential build failure in Dockerfile.metax by properly initializing the cu-bridge build process with its pre_make script.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| RUN cd /tmp/ && \ | ||
| export MACA_PATH=/opt/maca && \ | ||
| curl -o ${CU_BRIDGE_VERSION}.zip -LsSf https://gitee.com/metax-maca/cu-bridge/repository/archive/${CU_BRIDGE_VERSION}.zip && \ | ||
| unzip ${CU_BRIDGE_VERSION}.zip && \ | ||
| mv cu-bridge-${CU_BRIDGE_VERSION} cu-bridge && \ | ||
| chmod 755 cu-bridge -Rf && \ | ||
| cd cu-bridge && \ | ||
| mkdir build && cd build && \ | ||
| cmake -DCMAKE_INSTALL_PREFIX=/opt/maca/tools/cu-bridge ../ && \ | ||
| make && make install |
There was a problem hiding this comment.
The cu-bridge build process needs to be initialized by running its pre_make script. Without this initialization, the /root/cu-bridge/CUDA_DIR directory (referenced by CUDA_PATH on line 19) is never created. As a result, the symbolic link creation on line 160 (RUN ln -sf ${CUDA_PATH}/bin/nvcc ${CUDA_PATH}/bin/cucc) will fail because the destination directory does not exist.
Additionally, we should use the standard -R flag for chmod instead of the non-standard -Rf syntax at the end of the command.
RUN cd /tmp/ && \
export MACA_PATH=/opt/maca && \
curl -o ${CU_BRIDGE_VERSION}.zip -LsSf https://gitee.com/metax-maca/cu-bridge/repository/archive/${CU_BRIDGE_VERSION}.zip &&
unzip ${CU_BRIDGE_VERSION}.zip && \
mv cu-bridge-${CU_BRIDGE_VERSION} cu-bridge && \
chmod -R 755 cu-bridge && \
cd cu-bridge && \
mkdir build && cd build && \
cmake -DCMAKE_INSTALL_PREFIX=/opt/maca/tools/cu-bridge ../ && \
make && make install && \
/opt/maca/tools/cu-bridge/tools/pre_make
| RUN printf "[metax-centos]\n\ | ||
| name=Maca Driver Yum Repository\n\ | ||
| baseurl=https://repos.metax-tech.com/r/metax-driver-centos-$(uname -m)/\n\ | ||
| enabled=1\n\ | ||
| gpgcheck=0" > /etc/yum.repos.d/metax-driver-centos.repo | ||
|
|
||
| RUN dnf -y install python3-pip hostname && \ | ||
| dnf clean all | ||
|
|
||
| RUN python3 -m pip install uv -i $UV_INDEX_URL --trusted-host ${UV_TRUSTED_INDEX_HOST} && \ | ||
| uv venv /opt/venv --python=${PYTHON_VERSION} | ||
|
|
||
| RUN python3 --version && \ | ||
| uv self version | ||
|
|
||
| RUN yum install -y \ | ||
| unzip vim git openblas-devel make cmake \ | ||
| ninja-build gcc g++ procps-ng \ | ||
| libibverbs librdmacm libibumad \ | ||
| && yum clean all | ||
|
|
||
| RUN git clone --depth 1 --branch ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git | ||
| RUN git clone --depth 1 --branch ${VLLM_METAX_VERSION} https://github.com/MetaX-MACA/vLLM-metax.git | ||
| RUN git clone --depth 1 --branch ${VLLM_VERSION} https://github.com/vllm-project/vllm.git | ||
| RUN git clone --depth 1 --branch ${MEGATRON_VERSION} https://github.com/NVIDIA/Megatron-LM.git | ||
|
|
||
| # Step 1: install MACA SDK, Metax-Driver and cu-bridge | ||
| # Metax-Driver mainly contains vbios and kmd files, which are not needed in a container. | ||
| # Here we keep the mx-smi management tool. Kernel version mismatch errors are ignored. | ||
| RUN yum install -y metax-driver-${MACA_VERSION}* mxgvm && \ | ||
| yum clean all && rm -rf /var/cache/yum /tmp/* | ||
|
|
||
| RUN printf "[maca-sdk]\n\ | ||
| name=Maca Sdk Yum Repository\n\ | ||
| baseurl=https://repos.metax-tech.com/r/maca-sdk-rpm-$(uname -m)/\n\ | ||
| enabled=1\n\ | ||
| gpgcheck=0" > /etc/yum.repos.d/maca-sdk-rpm.repo | ||
|
|
||
| RUN yum install -y maca_sdk-${MACA_VERSION}* && \ | ||
| yum clean all && rm -rf /var/cache/yum /tmp/* |
There was a problem hiding this comment.
To optimize the Docker image build time and reduce the number of layers, we can combine the repository configurations, package installations, and git clones into fewer RUN instructions.
By adding both repositories first, we can install all required system packages (including metax-driver and maca_sdk) in a single yum install command. This also allows us to install binutils and numactl-libs early, completely eliminating the redundant yum install step later in the file.
RUN printf "[metax-centos]\n\
name=Maca Driver Yum Repository\n\
baseurl=https://repos.metax-tech.com/r/metax-driver-centos-$(uname -m)/\n\
enabled=1\n\
gpgcheck=0" > /etc/yum.repos.d/metax-driver-centos.repo && \
printf "[maca-sdk]\n\
name=Maca Sdk Yum Repository\n\
baseurl=https://repos.metax-tech.com/r/maca-sdk-rpm-$(uname -m)/\n\
enabled=1\n\
gpgcheck=0" > /etc/yum.repos.d/maca-sdk-rpm.repo
RUN yum install -y \
python3-pip hostname \
unzip vim git openblas-devel make cmake \
ninja-build gcc g++ procps-ng \
libibverbs librdmacm libibumad \
binutils numactl-libs \
metax-driver-${MACA_VERSION}* mxgvm \
maca_sdk-${MACA_VERSION}* \
&& yum clean all && rm -rf /var/cache/yum /tmp/*
RUN python3 -m pip install uv -i $UV_INDEX_URL --trusted-host ${UV_TRUSTED_INDEX_HOST} && \
uv venv /opt/venv --python=${PYTHON_VERSION}
RUN python3 --version && \
uv self version
RUN git clone --depth 1 --branch ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git && \
git clone --depth 1 --branch ${VLLM_METAX_VERSION} https://github.com/MetaX-MACA/vLLM-metax.git && \
git clone --depth 1 --branch ${VLLM_VERSION} https://github.com/vllm-project/vllm.git && \
git clone --depth 1 --branch ${MEGATRON_VERSION} https://github.com/NVIDIA/Megatron-LM.git
| RUN yum install -y \ | ||
| gcc \ | ||
| binutils \ | ||
| procps-ng \ | ||
| libibverbs \ | ||
| librdmacm \ | ||
| libibumad \ | ||
| openblas \ | ||
| numactl-libs \ | ||
| && yum clean all && rm -rf /var/cache/yum /tmp/* |
There was a problem hiding this comment.
| RUN cd vllm && \ | ||
| python3 use_existing_torch.py && \ | ||
| uv pip install -r requirements/build/cuda.txt | ||
|
|
||
| RUN cd vllm && \ | ||
| VLLM_TARGET_DEVICE=empty uv pip install -v . --no-build-isolation |
There was a problem hiding this comment.
These two RUN blocks can be combined into a single RUN instruction to reduce the number of image layers and avoid redundant directory changes (cd vllm).
RUN cd vllm && \
python3 use_existing_torch.py && \
uv pip install -r requirements/build/cuda.txt && \
VLLM_TARGET_DEVICE=empty uv pip install -v . --no-build-isolation
| RUN echo $PATH | ||
| RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/* |
| # Clone all GitHub sources while the external proxy is enabled. | ||
| RUN rm -rf /workspace/ms-swift /workspace/vLLM-metax /workspace/vllm /workspace/Megatron-LM | ||
|
|
||
| RUN git clone --depth 1 --branch ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git | ||
| RUN git clone --depth 1 --branch ${VLLM_METAX_VERSION} https://github.com/MetaX-MACA/vLLM-metax.git | ||
| RUN git clone --depth 1 --branch ${VLLM_VERSION} https://github.com/vllm-project/vllm.git | ||
| RUN git clone --depth 1 --branch ${MEGATRON_VERSION} https://github.com/NVIDIA/Megatron-LM.git |
There was a problem hiding this comment.
We can combine the cleanup and the multiple git clone commands into a single RUN instruction to minimize the number of intermediate image layers.
RUN rm -rf /workspace/ms-swift /workspace/vLLM-metax /workspace/vllm /workspace/Megatron-LM && \
git clone --depth 1 --branch ${SWIFT_VERSION} https://github.com/modelscope/ms-swift.git && \
git clone --depth 1 --branch ${VLLM_METAX_VERSION} https://github.com/MetaX-MACA/vLLM-metax.git && \
git clone --depth 1 --branch ${VLLM_VERSION} https://github.com/vllm-project/vllm.git && \
git clone --depth 1 --branch ${MEGATRON_VERSION} https://github.com/NVIDIA/Megatron-LM.git
| # Step 1: build original vLLM for torch setup | ||
| RUN cd vllm && \ | ||
| python3 use_existing_torch.py && \ | ||
| pip install -r requirements/build/cuda.txt | ||
|
|
||
| # Step 2: build vLLM with empty device to avoid CUDA dependency | ||
| RUN cd vllm && \ | ||
| VLLM_TARGET_DEVICE=empty pip install -v . --no-build-isolation |
There was a problem hiding this comment.
These two RUN blocks can be combined into a single RUN instruction to reduce the number of image layers and avoid redundant directory changes (cd vllm).
RUN cd vllm && \
python3 use_existing_torch.py && \
pip install -r requirements/build/cuda.txt && \
VLLM_TARGET_DEVICE=empty pip install -v . --no-build-isolation
|
Please run the lint test first with the following commands: pip install pre-commit
pre-commit run --all-files |
* Upgrade numpy to 2.x for 1.38 Docker images - Replace deprecated numpy aliases (np.math.ceil → math.ceil, np.Inf → np.inf) - Upgrade Docker constraints: numpy>=2.0, cython>=3.0, remove scipy upper bound * Add ipywidgets dependency to Docker images * update docker * fix * fix * fix * fix cpu image * fix(docker): force numpy>=2.0 after evalscope install ms-opencompass pulls numpy<2.0, downgrading numpy from 2.x to 1.26.4. Force reinstall numpy>=2.0 after pip install .[eval] to restore it. * add metax dockerfile and its requirements for ms-swift 4.2.x (#1734) * Update npu dockerfile (#1736) * fix(docker): remove unrelated added docker files --------- Co-authored-by: Jintao Huang <huangjintao.hjt@alibaba-inc.com> Co-authored-by: dwd <dwd1044898101@gmail.com> Co-authored-by: addsubmuldiv <zyh13227@163.com>
add metax dockerfiles and related build scripts and requirements for ms-swift 4.2.x