Update memory profile script and troubleshooting guide for multimodal models

This commit is contained in:
John Doe
2026-03-01 23:41:08 -05:00
parent d6e7798d20
commit f4caadcc5c
2 changed files with 6 additions and 2 deletions
+3
View File
@@ -34,6 +34,7 @@ Typical log pattern:
### Why this happens
Ollama can detect ROCm correctly but still OOM during model graph/KV allocation, often due to high context size or aggressive defaults.
If logs show a multimodal architecture such as `qwen3vl` and stack frames under `multimodal.go`, a small-parameter model can still fail due to multimodal graph reservation.
### Fix
@@ -59,6 +60,8 @@ Preset selection quick guide:
| 20B to 32B | `balanced` | Best first choice for stability on large models. |
| 30B+ with load failures/OOM | `safe` | Uses lower context, `KEEP_ALIVE=0`, and higher GPU overhead reserve. |
For troubleshooting, test a text-only model first (for example `qwen3:8b`) before VL/multimodal models.
If a model fails to load, step down from `max``balanced``safe` before manual tuning.
Manual method:
+3 -2
View File
@@ -69,12 +69,12 @@ done
case "$PRESET" in
safe)
PRESET_CONTEXT_LENGTH="4096"
PRESET_CONTEXT_LENGTH="2048"
PRESET_NUM_PARALLEL="1"
PRESET_MAX_LOADED_MODELS="1"
PRESET_FLASH_ATTENTION="false"
PRESET_KEEP_ALIVE="0"
PRESET_GPU_OVERHEAD_BYTES="2147483648"
PRESET_GPU_OVERHEAD_BYTES="4294967296"
;;
balanced)
PRESET_CONTEXT_LENGTH="8192"
@@ -202,6 +202,7 @@ set -euo pipefail
install -d /etc/systemd/system/ollama.service.d
cat >/etc/systemd/system/ollama.service.d/20-memory-tuning.conf <<EOF
[Service]
LimitMEMLOCK=infinity
Environment=OLLAMA_CONTEXT_LENGTH=${CONTEXT_LENGTH}
Environment=OLLAMA_MAX_LOADED_MODELS=${MAX_LOADED_MODELS}
Environment=OLLAMA_NUM_PARALLEL=${NUM_PARALLEL}