1

Question

Why sudo apt upgrade on the host OS is required to make CUDA work in Docker container? The problem does not occur without Docker, but occurs only when a Docker image is recreated.

Environment

Ubuntu 22.04 LTS
Docker version 26.0.1, build d260a54

Dockerfile

#--------------------------------------------------------------------------------
# Dockerfile to build the base image with requirements and models downloaded.
#
# CUDA 11.7 and Pytorch is 1.13.1 due to the Deepdoctection requirements.
# https://github.com/deepdoctection/deepdoctection#requirements
# Pytorch that satisfies 1.12 <= PyTorch < 2.0 is 1.13.1.
# https://pytorch.org/get-started/previous-versions/#v1130
#--------------------------------------------------------------------------------
FROM nvidia/cuda:11.7.1-devel-ubuntu22.04

Create working directory

WORKDIR /home/eml

Copy under code/python

COPY . .

Note: every run command will create a image layer increaseing the image size.

#--------------------------------------------------------------------------------

Ubuntu libs and Timezone (https://serverfault.com/q/949991).

[deepdoctection dependency]

- poppler

https://pdf2image.readthedocs.io/en/latest/installation.html#installing-poppler

- tesseract-ocr

- qpdf for encrypted pdf. See AIML-130.

#-------------------------------------------------------------------------------- ARG DEBIAN_FRONTEND=noninteractive ENV TZ=Australia/Sydney RUN apt -y update &&
apt install -y tzdata
software-properties-common git cmake wget pkg-config tree ffmpeg libsm6 libxext6
tesseract-ocr libtesseract-dev tesseract-ocr-eng poppler-utils qpdf jq gpustat
|| exit 1

#--------------------------------------------------------------------------------

Py3.10 libs

https://launchpad.net/~deadsnakes/+archive/ubuntu/ppa

https://askubuntu.com/a/1398569

https://www.youtube.com/watch?v=Xe40amojaXE

#-------------------------------------------------------------------------------- RUN add-apt-repository --yes ppa:deadsnakes/ppa &&
apt install -y python3.10 python3-pip build-essential libssl-dev libffi-dev python3-venv
|| exit 1

#--------------------------------------------------------------------------------

Pytorch/CUDA

https://pytorch.org/get-started/previous-versions/#linux-and-windows-9

#-------------------------------------------------------------------------------- RUN pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1
--extra-index-url https://download.pytorch.org/whl/cu117

#--------------------------------------------------------------------------------

Group/User

#-------------------------------------------------------------------------------- #RUN groupadd -g 2000 eml && \

useradd -rm -d /home/eml -s /bin/bash -g eml -u 2001 eml && \

chown -R eml:eml /home/eml

#--------------------------------------------------------------------------------

Non root user

Cause issues e.g.

- mounted volume access check with os/pathlib does not work.

- torch.cuda_is_available() becomes False.

Need research how to use non-root user with file permissions, GPU with non-root

docker user.

#--------------------------------------------------------------------------------

USER eml

ENV PATH="${PATH}:${HOME}/.local/bin"

#--------------------------------------------------------------------------------

Packages

#-------------------------------------------------------------------------------- RUN pip install -r ./requirements.txt &&
python3 -m spacy download en_core_web_trf &&
python3 -m nltk.downloader words &&
python3 -m nltk.downloader wordnet &&
huggingface-cli download sentence-transformers/gtr-t5-large
|| exit 1

#--------------------------------------------------------------------------------

Run the application

https://stackoverflow.com/a/46245972/4281353

> if you have a docker image where your script is the ENTRYPOINT, any arguments

> you pass to the docker run command will be added to the entrypoint.

> ```

> docker run --rm <yourImageName> -a API_KEY - f FILENAME -o ORG_ID

> ```

#--------------------------------------------------------------------------------

Executable to run by this container is always Python3

ENTRYPOINT ["python3"]

Problem

When the docker image is re-created, then the Pytorch fails to detect CUDA until sudo apt upgrade -y and reboot get done.

  File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 240, in _load_from_bytes
    return torch.load(io.BytesIO(b))
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 795, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1012, in _legacy_load
    result = unpickler.load()
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 958, in persistent_load
    wrap_storage=restore_location(obj, location),
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 215, in default_restore_location
    result = fn(storage, location)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 182, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 166, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

It seems the CUDA or NVIDIA driver changes in the apt repository cause the problem because it causes incompatibility or deviation between NVIDIA driver on the host OS and the CUDA toolkit inside the docker container, but why?

muru
  • 207,228
mon
  • 329

0 Answers0