Question
Why sudo apt upgrade on the host OS is required to make CUDA work in Docker container? The problem does not occur without Docker, but occurs only when a Docker image is recreated.
Environment
Ubuntu 22.04 LTS
Docker version 26.0.1, build d260a54
Dockerfile
#--------------------------------------------------------------------------------
# Dockerfile to build the base image with requirements and models downloaded.
#
# CUDA 11.7 and Pytorch is 1.13.1 due to the Deepdoctection requirements.
# https://github.com/deepdoctection/deepdoctection#requirements
# Pytorch that satisfies 1.12 <= PyTorch < 2.0 is 1.13.1.
# https://pytorch.org/get-started/previous-versions/#v1130
#--------------------------------------------------------------------------------
FROM nvidia/cuda:11.7.1-devel-ubuntu22.04
Create working directory
WORKDIR /home/eml
Copy under code/python
COPY . .
Note: every run command will create a image layer increaseing the image size.
#--------------------------------------------------------------------------------
Ubuntu libs and Timezone (https://serverfault.com/q/949991).
[deepdoctection dependency]
- poppler
https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler
- tesseract-ocr
- qpdf for encrypted pdf. See AIML-130.
#--------------------------------------------------------------------------------
ARG DEBIAN_FRONTEND=noninteractive
ENV TZ=Australia/Sydney
RUN apt -y update &&
apt install -y tzdata
software-properties-common git cmake wget pkg-config tree ffmpeg libsm6 libxext6
tesseract-ocr libtesseract-dev tesseract-ocr-eng poppler-utils qpdf jq gpustat
|| exit 1
#--------------------------------------------------------------------------------
Py3.10 libs
https://launchpad.net/~deadsnakes/+archive/ubuntu/ppa
https://askubuntu.com/a/1398569
https://www.youtube.com/watch?v=Xe40amojaXE
#--------------------------------------------------------------------------------
RUN add-apt-repository --yes ppa:deadsnakes/ppa &&
apt install -y python3.10 python3-pip build-essential libssl-dev libffi-dev python3-venv
|| exit 1
#--------------------------------------------------------------------------------
Pytorch/CUDA
https://pytorch.org/get-started/previous-versions/#linux-and-windows-9
#--------------------------------------------------------------------------------
RUN pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1
--extra-index-url https://download.pytorch.org/whl/cu117
#--------------------------------------------------------------------------------
Group/User
#--------------------------------------------------------------------------------
#RUN groupadd -g 2000 eml && \
useradd -rm -d /home/eml -s /bin/bash -g eml -u 2001 eml && \
chown -R eml:eml /home/eml
#--------------------------------------------------------------------------------
Non root user
Cause issues e.g.
- mounted volume access check with os/pathlib does not work.
- torch.cuda_is_available() becomes False.
Need research how to use non-root user with file permissions, GPU with non-root
docker user.
#--------------------------------------------------------------------------------
USER eml
ENV PATH="${PATH}:${HOME}/.local/bin"
#--------------------------------------------------------------------------------
Packages
#--------------------------------------------------------------------------------
RUN pip install -r ./requirements.txt &&
python3 -m spacy download en_core_web_trf &&
python3 -m nltk.downloader words &&
python3 -m nltk.downloader wordnet &&
huggingface-cli download sentence-transformers/gtr-t5-large
|| exit 1
#--------------------------------------------------------------------------------
Run the application
https://stackoverflow.com/a/46245972/4281353
> if you have a docker image where your script is the ENTRYPOINT, any arguments
> you pass to the docker run command will be added to the entrypoint.
> ```
> docker run --rm <yourImageName> -a API_KEY - f FILENAME -o ORG_ID
> ```
#--------------------------------------------------------------------------------
Executable to run by this container is always Python3
ENTRYPOINT ["python3"]
Problem
When the docker image is re-created, then the Pytorch fails to detect CUDA until sudo apt upgrade -y and reboot get done.
File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 240, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 795, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1012, in _legacy_load
result = unpickler.load()
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 958, in persistent_load
wrap_storage=restore_location(obj, location),
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 215, in default_restore_location
result = fn(storage, location)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 182, in _cuda_deserialize
device = validate_cuda_device(location)
File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 166, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
It seems the CUDA or NVIDIA driver changes in the apt repository cause the problem because it causes incompatibility or deviation between NVIDIA driver on the host OS and the CUDA toolkit inside the docker container, but why?
