Running a massively scalable CUDA-accelerated AI/ML lab on WSL 2 with Determined
Getting a scalable AI/ML model training environment set up and running on WSL 2, with Docker Desktop and CUDA GPU compute.
data:image/s3,"s3://crabby-images/88a8c/88a8ccdcbbe9cbf5da500f49f559160d841dd4d9" alt="Running a massively scalable CUDA-accelerated AI/ML lab on WSL 2 with Determined"
Determined is an open source platform for AI/ML model development and training at scale.
Determined handles the provisioning of machines, networking, data loading, and provides fault tolerance.
data:image/s3,"s3://crabby-images/95eab/95eabe73f84c4c92dbe092fae7e6e6290e7fd870" alt=""
It allows AI/ML engineers to pool and share computing resources, track experiments, and supports deep learning frameworks like PyTorch, TensorFlow, and Keras.
I am still learning about AI/ML. My interest was piqued after GPU compute arrived on Windows Subsystem for Linux (WSL), starting with CUDA.
Determined seems like a very cool and easy to use platform to learn more on, it offers a web-based dashboard and includes a built-in AI/ML IDE.
There are several ways to deploy Determined, including pip
, and to use Determined, like a terminal cli tool, det
.
My preference is a more cloud native approach deploying with containers and interacting through the web-based dashboard.
This guide will cover setting up a local Determined deployment on WSL 2 with Docker Desktop.
We will:
- Verify a working GPU setup on WSL
- Deploy a database backend container
- Deploy a Determined master node container connected to the database
- Deploy and connect a Determined agent node container to the Determined master node
- Launch the JupyterLab IDE in the Determined web interface
Requirements for this tutorial:
- Windows 11 (recommended) or Windows 10 21H2
- Windows Subsystem for Linux Preview from the Microsoft Store (recommended) or the standard Windows Subsystem for Linux feature but run
wsl.exe --update
to make sure you have the latest WSL kernel - The latest NVIDIA GPU drivers directly from NVIDIA, not just Windows Update drivers
- Any WSL distro
- Docker Desktop 4.9+ installed with WSL integration enabled for the WSL distro you are going to be working in
- A CUDA-enabled NVIDIA GPU, e.g. GeForce RTX 1080 or higher*
*This workflow does work without a CUDA-enabled NVDIA GPU but will default to CPU-only if no GPU is available.
Links
- Determined.AI
- Determined Docs
- Enable NVIDIA CUDA on WSL (Microsoft Docs)
Basics
Verify that Docker Desktop is accessible from WSL 2:
docker --version
data:image/s3,"s3://crabby-images/f3120/f3120ccc08eec2288cbf126d832995d14ed2cea5" alt=""
This should not be docker-ce or an equivalent installed in WSL, but the aliases Docker Desktop places using WSL integration:
data:image/s3,"s3://crabby-images/41157/411572f00153fe898f0066f35c75448753306170" alt=""
Verify that GPU support is working in Docker and WSL 2:
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
data:image/s3,"s3://crabby-images/de09a/de09affadf36cf4b08e790d65c8a8b6a31af14d0" alt=""
Note my NVIDIA GeForce 2070 Super is visible in nvidia-smi
output.
Set up PostgreSQL
Start an instance of PostgreSQL:
docker run -d --name determined-db -p 5432:5432 -v determined_db:/var/lib/postgresql/data -e POSTGRES_DB=determined -e POSTGRES_PASSWORD=password postgres:10
I recommend changing your password to anything besides password
.
data:image/s3,"s3://crabby-images/23b32/23b320906192e467246eef2bb3f5295e7fad3773" alt=""
data:image/s3,"s3://crabby-images/71c20/71c20e4e6dbfe29fc5eed718ad2110ca82f3158f" alt=""
Get your WSL IP address
Grab your WSL instance's eth0
IP address from ip
, parse it using sed
, and stash it as an environmental variable $WSLIP
:
WSLIP=$(ip -f inet addr show eth0 | sed -En -e 's/.*inet ([0-9.]+).*/\1/p')
data:image/s3,"s3://crabby-images/56464/56464b1c0b34aedeb21af3083e3444743500f039" alt=""
Start the Determined Master Node
Start up an instance of the determined-master image, connected to the PostgreSQL determined
database we spun up on port 5432:
docker run -d --name determined-master -p 8080:8080 -e DET_DB_HOST=$WSLIP -e DET_DB_NAME=determined -e DET_DB_PORT=5432 -e DET_DB_USER=postgres -e DET_DB_PASSWORD=password determinedai/determined-master:latest
data:image/s3,"s3://crabby-images/b2a5e/b2a5ea51d202bfa47b61c3fbaae80275c431be7b" alt=""
data:image/s3,"s3://crabby-images/6333c/6333cc2320995abd262a6587f130f3d70a886fe8" alt=""
Launch the Determined Master Node web dashboard:
powershell.exe /c start http://$WSLIP:8080
data:image/s3,"s3://crabby-images/1a42e/1a42e3a8d3c1176cb40f00d2065d70dd39b67e9c" alt=""
Use the default admin
account, no password, to log in.
Now you have access to the Determined dashboard.
data:image/s3,"s3://crabby-images/7d90e/7d90e20f883d40787b29f0aa662e37fda616a920" alt=""
But we do not have any agents connected to run experiments on.
data:image/s3,"s3://crabby-images/4032c/4032cca7702ee7cbe0dabad343ee85ef10faff45" alt=""
Attach a Determined Agent Node
Start up an instance of the determined-agent image, pointed at our Determined Master host IP ($WSLIP) and port (8080):
docker run -d --gpus all -v /var/run/docker.sock:/var/run/docker.sock --name determined-agent -e DET_MASTER_HOST=$WSLIP -e DET_MASTER_PORT=8080 -e NVIDIA_DRIVER_CAPABILITIES=compute,utility determinedai/determined-agent:latest
data:image/s3,"s3://crabby-images/e7ec1/e7ec15851b1229e8935234133380691054784e86" alt=""
Note:
- Include
--gpus all
is to pass-through our NVIDIA GPU to the determined-agent container. - Set
NVIDIA_DRIVER_CAPABILITIES
to also includecompute
, overriding the determined-agent default of justutility
. This enables the agent to detect the pass-through CUDA GPU. This issue was documented and I submitted a PR. - If you do not have an CUDA-enabled GPU and wish to use CPU only, use:
docker run -d -v /var/run/docker.sock:/var/run/docker.sock --name determined-agent -e DET_MASTER_HOST=$WSLIP -e DET_MASTER_PORT=8080 determinedai/determined-agent:latest
data:image/s3,"s3://crabby-images/b980e/b980e0cfd0aeca0906f6d737d92a2b66b72388a6" alt=""
Return to the Determined dashboard, to see our clusters:
powershell.exe /c start http://$WSLIP:8080/det/clusters
We can now see 1 connected agent and 0/1 CUDA slots allocated, ready for training deep learning models:
data:image/s3,"s3://crabby-images/2c0a5/2c0a556c9b68c922c4985f331cd2f513c017b55d" alt=""
Click Launch JupyterLab
to spin up a web-based Python IDE for notebooks, code, and data:
data:image/s3,"s3://crabby-images/5e815/5e815c2b72510c3370a595c842d505fa9c52f88f" alt=""
And our available CUDA GPU will be automatically assigned. You can see how it is provisioned and visible in the Determined dashboard:
data:image/s3,"s3://crabby-images/55f80/55f801f40ffb52af808fefa0c0d9b813318c8e48" alt=""
data:image/s3,"s3://crabby-images/b229c/b229c106493827ccafa39ba45a1416ccbd86e7a4" alt=""
data:image/s3,"s3://crabby-images/43ea5/43ea53a7688ffc68e08a2e35f08b650ca1fed6cd" alt=""
And now we have a CUDA-accelerated JupyterLab Python AI/ML IDE:
data:image/s3,"s3://crabby-images/e0436/e04362ced21be5f4fee7e14be69b7ae5df06b9e4" alt=""
We can even start up additional CPU-only Determined worker agents:
docker run -d -v /var/run/docker.sock:/var/run/docker.sock --name determined-agent-2 -e DET_MASTER_HOST=$WSLIP -e DET_MASTER_PORT=8080 determinedai/determined-agent:latest
Note the tweaked the name of the image to determined-agent-2
.
And see those resources available in the Determined web dashboard:
data:image/s3,"s3://crabby-images/d6814/d6814219b6863a517560579ecdc012096333b2ef" alt=""
Notes
- When stopping determined-agent, be sure to stop determined-fluent too.