Update memory profile script and troubleshooting guide for multimodal models

2026-07-01 19:54:40 -04:00 · 2026-03-01 23:41:08 -05:00
parent d6e7798d20
commit f4caadcc5c
2 changed files with 6 additions and 2 deletions
@@ -34,6 +34,7 @@ Typical log pattern:
 ### Why this happens

 Ollama can detect ROCm correctly but still OOM during model graph/KV allocation, often due to high context size or aggressive defaults.
+If logs show a multimodal architecture such as `qwen3vl` and stack frames under `multimodal.go`, a small-parameter model can still fail due to multimodal graph reservation.

 ### Fix

@@ -59,6 +60,8 @@ Preset selection quick guide:
 | 20B to 32B | `balanced` | Best first choice for stability on large models. |
 | 30B+ with load failures/OOM | `safe` | Uses lower context, `KEEP_ALIVE=0`, and higher GPU overhead reserve. |

+For troubleshooting, test a text-only model first (for example `qwen3:8b`) before VL/multimodal models.
+
 If a model fails to load, step down from `max` → `balanced` → `safe` before manual tuning.

 Manual method:
@@ -69,12 +69,12 @@ done

 case "$PRESET" in
  safe)
-    PRESET_CONTEXT_LENGTH="4096"
+    PRESET_CONTEXT_LENGTH="2048"
    PRESET_NUM_PARALLEL="1"
    PRESET_MAX_LOADED_MODELS="1"
    PRESET_FLASH_ATTENTION="false"
    PRESET_KEEP_ALIVE="0"
-    PRESET_GPU_OVERHEAD_BYTES="2147483648"
+    PRESET_GPU_OVERHEAD_BYTES="4294967296"
    ;;
  balanced)
    PRESET_CONTEXT_LENGTH="8192"
@@ -202,6 +202,7 @@ set -euo pipefail
 install -d /etc/systemd/system/ollama.service.d
 cat >/etc/systemd/system/ollama.service.d/20-memory-tuning.conf <<EOF
 [Service]
+LimitMEMLOCK=infinity
 Environment=OLLAMA_CONTEXT_LENGTH=${CONTEXT_LENGTH}
 Environment=OLLAMA_MAX_LOADED_MODELS=${MAX_LOADED_MODELS}
 Environment=OLLAMA_NUM_PARALLEL=${NUM_PARALLEL}