4.7 KiB
Ollama ROCm Troubleshooting (Proxmox LXC)
This guide is for the toolkit setup in this repo:
- Unprivileged Ubuntu 24.04 LXC on Proxmox
- ROCm-enabled GPU passthrough
- Ollama managed by systemd (service user
llm-svc)
Quick triage checklist
Run these first in the CT:
systemctl status ollama --no-pager
systemctl show ollama -p User,Group,SupplementaryGroups,Environment --no-pager
journalctl -u ollama -n 200 --no-pager
ss -ltnp | grep 11434 || true
From another machine:
curl http://<ct-ip>:11434/api/tags
Symptom: Open WebUI can connect, but model load fails with 500
Typical log pattern:
library=ROCmis detected- then
ROCm error: out of memory - then
model failed to load
Why this happens
Ollama can detect ROCm correctly but still OOM during model graph/KV allocation, often due to high context size or aggressive defaults.
If logs show a multimodal architecture such as qwen3vl and stack frames under multimodal.go, a small-parameter model can still fail due to multimodal graph reservation.
Fix
One-command helper from the Proxmox host:
sudo bash ./scripts/set_ollama_memory_profile_in_ct.sh --ctid <ctid>
Preset examples:
sudo bash ./scripts/set_ollama_memory_profile_in_ct.sh --ctid <ctid> --preset safe
sudo bash ./scripts/set_ollama_memory_profile_in_ct.sh --ctid <ctid> --preset balanced
sudo bash ./scripts/set_ollama_memory_profile_in_ct.sh --ctid <ctid> --preset max
Preset selection quick guide:
| Model size (typical) | Suggested preset | Notes |
|---|---|---|
| 7B to 14B | max |
Highest throughput, more aggressive memory/concurrency. |
| 20B to 32B | balanced |
Best first choice for stability on large models. |
| 30B+ with load failures/OOM | safe |
Uses lower context, KEEP_ALIVE=0, and higher GPU overhead reserve. |
For troubleshooting, test a text-only model first (for example qwen3:8b) before VL/multimodal models.
If a model fails to load, step down from max → balanced → safe before manual tuning.
Manual method:
Create a memory-tuning drop-in:
install -d /etc/systemd/system/ollama.service.d
cat >/etc/systemd/system/ollama.service.d/20-memory-tuning.conf <<'EOF'
[Service]
Environment=OLLAMA_CONTEXT_LENGTH=8192
Environment=OLLAMA_MAX_LOADED_MODELS=1
Environment=OLLAMA_NUM_PARALLEL=1
Environment=OLLAMA_FLASH_ATTENTION=false
EOF
systemctl daemon-reload
systemctl restart ollama
If still failing, reduce context further:
sed -i 's/OLLAMA_CONTEXT_LENGTH=8192/OLLAMA_CONTEXT_LENGTH=4096/' /etc/systemd/system/ollama.service.d/20-memory-tuning.conf
systemctl daemon-reload
systemctl restart ollama
Also reduce request-side settings in Open WebUI for that model:
num_ctx:4096or8192num_gpu: start moderate (do not force maximum layer offload)- concurrency: 1 request at a time
Symptom: Ollama only uses CPU (library=cpu, total_vram=0 B)
Checks
id llm-svc
ls -l /dev/kfd /dev/dri/renderD* 2>/dev/null || true
getent group video
getent group render
llm-svc must be in video and render groups and the devices must exist in the CT.
Fix
usermod -aG video,render llm-svc
systemctl restart ollama
If devices are missing, re-run host passthrough script and restart CT:
sudo bash ./scripts/configure_gpu_passthrough.sh --ctid <ctid>
pct stop <ctid>
pct start <ctid>
Symptom: API only listening on localhost
Fix (host side helper)
sudo bash ./scripts/expose_ollama_in_ct.sh --ctid <ctid>
Custom bind/port:
sudo bash ./scripts/expose_ollama_in_ct.sh --ctid <ctid> --listen 0.0.0.0 --port 11434
Revert to defaults:
sudo bash ./scripts/close_ollama_network_in_ct.sh --ctid <ctid>
Symptom: External network still cannot reach port 11434
Checks
- Proxmox firewall rules at Datacenter/Node/CT levels
- CT firewall (
ufw) if enabled - Correct CT IP/subnet route from Open WebUI host
Minimal firewall rules
In CT (if ufw is used):
ufw allow from <LAN_CIDR> to any port 11434 proto tcp
Log line: failed to parse CPU allowed micro secs in LXC
This warning is common in containers where cgroup CPU quota reports max. It is usually non-fatal and not the root cause if ROCm is detected and only model load fails.
Recommended stable baseline for large models
Start conservative, then scale up:
OLLAMA_CONTEXT_LENGTH=4096to8192OLLAMA_NUM_PARALLEL=1OLLAMA_MAX_LOADED_MODELS=1- one active large model at a time
Then monitor:
journalctl -u ollama -f
Look for successful loading model and absence of ROCm error: out of memory.
Security note
Ollama has no built-in authentication by default. If exposed beyond localhost, restrict by firewall or place behind an authenticated reverse proxy.