Build and run a Docker image
CSCS provides and maintains the Container Engine (CE), a set of tools to help users run containerized applications exploiting all the features of the underlying hardware.
This how-to will walk through all the steps needed to configure the CE, create a Dockerfile, build and export the image, and finally run the image on a computing node. We will use PyTorch as a working example, starting from NVIDIA images catalog (NGC).
Preparing the Dockerfile
-
Move to a directory of your choice (preferably somewhere in your
$SCRATCH
folder) and create the following file nameDockerfile
ARG BASE_TAG=25.01-py3FROM nvcr.io/nvidia/pytorch:$BASE_TAGENV DEBIAN_FRONTEND=noninteractiveRUN apt-get update && apt-get install -y python3-venv && apt-get clean && rm -rf /var/lib/apt/lists/* -
Before we can build the image from that Dockerfile, we need to configure the storage locations of Podman, the tool used to build a Docker image. Create the file
~/.config/containers/storage.conf
with this content:[storage]driver = "overlay"runroot = "/dev/shm/$USER/runroot"graphroot = "/dev/shm/$USER/root"[storage.options.overlay]mount_program = "/usr/bin/fuse-overlayfs-1.13"
Building the image
Since it might take some time, we are going to build the image on a computing node.
-
Ask an interactive reservation
Terminal window srun --account <your-account> --pty bashWhere you need to replace
your-account
with your actual project’s name (or account). -
Once we got our allocation, we can move to the folder where we created the Dockerfile and run
Terminal window podman build -t pytorch:25.01-py3 .It might take some time depending on the complexity of the Dockerfile. In this case, it should be fairly quick and take only around 10 minutes or so.
-
Before our allocation ends, we must export the image built somewhere on disk. Otherwise, since the building happened in the compute node’s memory, once our allocations expires, we’ll lose everything. We can export the image just built with:
Terminal window enroot import -x mount -o pytorch-25.01-py3.sqsh podman://pytorch:25.01-py3-venvThe command above will create the file
pytorch-25.01-py3.sqsh
in the current working directory.
Running the container
The last step before running a container from the image we built is creating an Environment Definition File (EDF). Such file tells the CE where to look for the image to run, which folders to mount inside the container, and many other options we might need/want to customize. For example, we might want to set some environment variables that we need inside the container upon it starts.
-
Create the following file:
~/.edf/pytorch-25.01-py3.toml
(and create the~/.edf
directory if it doesn’t exist).image = "/path/to/your/pytorch-25.01-py3.sqsh" # Adjust this pathmounts = ["/capstor", "/users"]workdir = "/capstor/scratch/cscs/<username>" # Change this with your usernamewritable = true # This is important if we want to be able to modify the container while running it[annotations]com.hooks.aws_ofi_nccl.enabled = "true"com.hooks.aws_ofi_nccl.variant = "cuda12"[env]FI_CXI_DISABLE_HOST_REGISTER = "1"FI_MR_CACHE_MONITOR = "userfaultfd"NCCL_DEBUG = "INFO" -
Now we’re ready to run the container with the command:
Terminal window srun -A <account> --environment=pytorch-25.01-py3 --pty bashThe
--environment
option will look for a valid EDF file (the one we just created), and the loads the specifications of the container (like the image we want to use) from that file, plus the additional options.You can also specify a different working directory with
--container-workdir=<your-dir>
. You can set it to$PWD
if you want to be in the current working directory when the container starts.
Test the container
Once our allocation is granted, we’ll be inside our running container.
Since we created a PyTorch Docker image, we can test that all went fine with a Python script that tries to import the torch
package and runs some hardware checks:
import torch
def check_pytorch_cuda(): print("Checking PyTorch Installation...")
# Check if PyTorch is installed try: print(f"PyTorch Version: {torch.__version__}") except AttributeError: print("PyTorch is not installed correctly.") return
# Check if CUDA is available if torch.cuda.is_available(): print("CUDA is available!") print(f"CUDA Version: {torch.version.cuda}") print(f"cuDNN Version: {torch.backends.cudnn.version()}") print(f"Number of CUDA Devices: {torch.cuda.device_count()}") for i in range(torch.cuda.device_count()): print(f"Device {i}: {torch.cuda.get_device_name(i)}") else: print("CUDA is NOT available. Check your installation.")
if __name__ == "__main__": check_pytorch_cuda()
If all went fine, we should see the following output (which is specific of the machine where the container is run):
Checking PyTorch Installation...PyTorch Version: 2.6.0a0+ecf3bae40a.nv25.01CUDA is available!CUDA Version: 12.8cuDNN Version: 90700Number of CUDA Devices: 4Device 0: NVIDIA GH200 120GBDevice 1: NVIDIA GH200 120GBDevice 2: NVIDIA GH200 120GBDevice 3: NVIDIA GH200 120GB