Build and run a Docker image

CSCS provides and maintains the Container Engine (CE), a set of tools to help users run containerized applications exploiting all the features of the underlying hardware.

This how-to will walk through all the steps needed to configure the CE, create a Dockerfile, build and export the image, and finally run the image on a computing node. We will use PyTorch as a working example, starting from NVIDIA images catalog (NGC).

To be able to pull images from NGC at the nvcr.io URL, you need a valid API key. You can get one by creating an account (or logging in to an existing one), going to “Setup”, and generating a new API key. Once you have that, create a new file ~/.config/enroot/.credentials and paste there the following content

machine nvcr.io login $oauthtoken password <your-API-key>

Don’t forget to replace your-API-key with the key you just created.

Preparing the Dockerfile

Move to a directory of your choice (preferably somewhere in your $SCRATCH folder) and create the following file name Dockerfile

ARG BASE_TAG=25.01-py3
FROM nvcr.io/nvidia/pytorch:$BASE_TAG

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y python3-venv && apt-get clean && rm -rf /var/lib/apt/lists/*

Before we can build the image from that Dockerfile, we need to configure the storage locations of Podman, the tool used to build a Docker image. Create the file ~/.config/containers/storage.conf with this content:
```
[storage]
driver = "overlay"
runroot = "/dev/shm/$USER/runroot"
graphroot = "/dev/shm/$USER/root"

[storage.options.overlay]
mount_program = "/usr/bin/fuse-overlayfs-1.13"
```

Building the image

Since it might take some time, we are going to build the image on a computing node.

Ask an interactive reservation
Terminal window
```
srun --account <your-account> --pty bash
```
Where you need to replace your-account with your actual project’s name (or account).
Once we got our allocation, we can move to the folder where we created the Dockerfile and run
Terminal window
```
podman build -t pytorch:25.01-py3 .
```
It might take some time depending on the complexity of the Dockerfile. In this case, it should be fairly quick and take only around 10 minutes or so.
Before our allocation ends, we must export the image built somewhere on disk. Otherwise, since the building happened in the compute node’s memory, once our allocations expires, we’ll lose everything. We can export the image just built with:
Terminal window
```
enroot import -x mount -o pytorch-25.01-py3.sqsh podman://pytorch:25.01-py3-venv
```
The command above will create the file pytorch-25.01-py3.sqsh in the current working directory.

Running the container

The last step before running a container from the image we built is creating an Environment Definition File (EDF). Such file tells the CE where to look for the image to run, which folders to mount inside the container, and many other options we might need/want to customize. For example, we might want to set some environment variables that we need inside the container upon it starts.

Create the following file: ~/.edf/pytorch-25.01-py3.toml (and create the ~/.edf directory if it doesn’t exist).

image = "/path/to/your/pytorch-25.01-py3.sqsh"  # Adjust this path
mounts = ["/capstor", "/users"]
workdir = "/capstor/scratch/cscs/<username>"    # Change this with your username
writable = true                                 # This is important if we want to be able to modify the container while running it

[annotations]
com.hooks.aws_ofi_nccl.enabled = "true"
com.hooks.aws_ofi_nccl.variant = "cuda12"

[env]
FI_CXI_DISABLE_HOST_REGISTER = "1"
FI_MR_CACHE_MONITOR = "userfaultfd"
NCCL_DEBUG = "INFO"

Now we’re ready to run the container with the command:
Terminal window
```
srun -A <account> --environment=pytorch-25.01-py3 --pty bash
```
The --environment option will look for a valid EDF file (the one we just created), and the loads the specifications of the container (like the image we want to use) from that file, plus the additional options.

You can also specify a different working directory with --container-workdir=<your-dir>. You can set it to $PWD if you want to be in the current working directory when the container starts.

Test the container

Once our allocation is granted, we’ll be inside our running container. Since we created a PyTorch Docker image, we can test that all went fine with a Python script that tries to import the torch package and runs some hardware checks:

import torch

def check_pytorch_cuda():
    print("Checking PyTorch Installation...")

    # Check if PyTorch is installed
    try:
        print(f"PyTorch Version: {torch.__version__}")
    except AttributeError:
        print("PyTorch is not installed correctly.")
        return

    # Check if CUDA is available
    if torch.cuda.is_available():
        print("CUDA is available!")
        print(f"CUDA Version: {torch.version.cuda}")
        print(f"cuDNN Version: {torch.backends.cudnn.version()}")
        print(f"Number of CUDA Devices: {torch.cuda.device_count()}")
        for i in range(torch.cuda.device_count()):
            print(f"Device {i}: {torch.cuda.get_device_name(i)}")
    else:
        print("CUDA is NOT available. Check your installation.")

if __name__ == "__main__":
    check_pytorch_cuda()

If all went fine, we should see the following output (which is specific of the machine where the container is run):

Checking PyTorch Installation...
PyTorch Version: 2.6.0a0+ecf3bae40a.nv25.01
CUDA is available!
CUDA Version: 12.8
cuDNN Version: 90700
Number of CUDA Devices: 4
Device 0: NVIDIA GH200 120GB
Device 1: NVIDIA GH200 120GB
Device 2: NVIDIA GH200 120GB
Device 3: NVIDIA GH200 120GB