Ollama

由 datahunter 在二, 07/05/2024 - 23:15 發表

最後更新: 2024-05-07

介紹

它是一個可以在本機運行知名 LLM (i.e. Llama 3, Phi 3, Mistral, Gemma) 的軟件

HomePage

Memory requirement

* At least (在用 CPU 行的情況)

Models   RAM(OS)
7B        8 GB
13B      16 GB
33B      32 GB

Notes

Llama 3(8B)      File Size: 4.7GB      # ollama run llama3
Llama 3(70B)     File Size: 40GB       # ollama run llama3:70b

Manual Install

V0.3.8 # 1.39 GB <- 在 0.3.7 開始發大 (ollamabinary along with required libraries)

V0.3.0 # 558 MB

Get binary filie

mkdir -p /usr/src/ollama; cd /usr/src/ollama

V=v0.3.8

LINK=https://github.com/ollama/ollama/releases/download/$V/ollama-linux-amd64.tgz

wget $LINK -O ollama-linux-amd64-$V.tgz

tar -zxf ollama-linux-amd64-$V.tgz

mkdir /opt/ollama

mv bin lib /opt/ollama

chown ollama: /opt/ollama -R

ln -sf /opt/ollama/bin/ollama /usr/bin/ollama

ollama -v

Warning: could not connect to a running Ollama instance
Warning: client version is 0.3.8

Startup service

/etc/systemd/system/ollama.service

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MAX_QUEUE=3"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=1"

[Install]
WantedBy=default.target

systemctl enable ollama --now

ps aux | grep ^ollama

# 未 load model 時它 8xx MB

ollama    347024  102  2.7 3225988 887676 ?      Ssl  07:34   0:28 /usr/bin/ollama serve

CLI

Pull a model

# Ollama 可下載的 LLM https://ollama.com/library

ollama pull llama3

# 沒有注明 size (:7b) 的話就會下載 default 那個 size

ollama pull codellama:7b

ollama pull codellama:13b-python

Remove a model

ollama rm llama3

Copy a model

ollama cp llama3 my-model

Displays currently loaded models

ollama ps # 當設定成 -1 時

NAME            ID              SIZE    PROCESSOR       UNTIL
gemma:latest    a72c7f4d0a15    6.7 GB  100% CPU        Forever

UNTIL 代表 cache 在 RAM 多耐

PROCESSOR

"48%/52% CPU/GPU" means the model was loaded partially onto both the GPU and into system memory

* The entire model is traversed for each token. 所以分開了會慢左

Multiple GPUs

If the model will entirely fit on any single GPU, Ollama will load the model on that GPU.

=> it reduces the amount of data transfering across the PCI bus during inference

Model memory usage

Model data is memory mapped and shows up in file cache

Ollama processes 係有自己的 VSZ, RSS

ollama show

Opts

--modelfile
--parameters
--template
--system

ollama show codellama:7b
  Model
        arch                    llama
        parameters              7B
        quantization            Q4_0
        context length          16384
        embedding length        4096

  Parameters
        rope_frequency_base     1e+06
        stop                    "[INST]"
        stop                    "[/INST]"
        stop                    "<<SYS>>"
        stop                    "<</SYS>>"

  License
        LLAMA 2 COMMUNITY LICENSE AGREEMENT
        Llama 2 Version Release Date: July 18, 2023

Find Model File Location

models 的 default 位置: ~ollama/.ollama/models

可以用 ENV OLLAMA_MODELS 去修改它

ollama show --modelfile codellama:13b

FROM /usr/share/ollama/.ollama/models/blobs/sha256-...

Notes

沒加 version 即是 ":latest"

phi3.5 = phi3.5:latest

tree /usr/share/ollama/.ollama/models/

/usr/share/ollama/.ollama/models/
├── blobs
│   ├── sha256-2e0493f67d0c8c9c68a8aeacdf6a38a2151cb3c4c1d42accf296e19810527988
│   ├── sha256-35e261fb2c733bb8c50e83e972781ec3ee589c83944aaac582e13ba679ac3592
│   ├── sha256-590d74a5569b8a20eb2a8b0aa869d1d1d3faf6a7fdda1955ae827073c7f502fc
│   ├── sha256-7f6a57943a88ef021326428676fe749d38e82448a858433f41dae5e05ac39963
│   ├── sha256-8c17c2ebb0ea011be9981cc3922db8ca8fa61e828c5d3f44cb6ae342bf80460b
│   └── sha256-e73cc17c718156e5ad34b119eb363e2c10389a503673f9c36144c42dfde8334c
└── manifests
    └── registry.ollama.ai
        └── library
            └── codellama
                └── 13b

# 查看 mediaType, digest

cat manifests/registry.ollama.ai/library/codellama/13b | jq

Run AI Bot

i.e.

ollama run llama3

ollama run codellama:7b

Multiline input

>>> """Hello,
... world!
... """

/? Help
/set Set session variables
/show Show model information
/load <model> Load a session or model
/save <model> Save your current session
/clear Clear session context
/bye Exit

/show info

Model details:
Family              qwen2
Parameter Size      8B
Quantization Level  Q4_0

REST API

* ollama default api port: 11434/tcp

更改 CLI client 連 Server 的 port

export OLLAMA_HOST="server:port"

Generate a response

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt":"Why is the sky blue?"
}'

Chat with a model

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    { "role": "user", "content": "why is the sky blue?" }
  ]
}'

Allow listening on all local interfaces

/etc/systemd/system/ollama.service

[Service]
...
# Environment="OLLAMA_HOST=0.0.0.0:8080"
# 使用 Default 的 11434
Environment="OLLAMA_HOST=0.0.0.0"

systemctl daemon-reload

systemctl restart ollama

修改其他設定

ollama.service

Environment="OLLAMA_MAX_QUEUE=3"
Environment="OLLAMA_KEEP_ALIVE=-1"

systemctl daemon-reload

systemctl restart ollama

說明

OLLAMA_MAX_QUEUE

Default: 512

If too many requests are sent to the server,
it will respond with a 503 error indicating the server is overloaded.

OLLAMA_KEEP_ALIVE

By default models are kept in memory for 5 minutes before being unloaded.

Use the keep_alive parameter with either the /api/generate and /api/chat API endpoints to control how long the model is left in memory.

e.g. 1h / 60m / 3600

# To preload a model and leave it in memory

URL=http://localhost:11434/api/generate

curl $URL -d '{"model": "llama3", "keep_alive": -1}'

# To unload a model from memory

curl $URL -d '{"model": "llama3", "keep_alive": 0}'

Checking

# 當 "UNTIL" 倒數到 0 後 Model 就會被 unload

NAME            ID              SIZE    PROCESSOR       UNTIL
gemma2:latest   ff02c3702f32    9.0 GB  100% CPU        8 minutes from now

ollama ps # 當設定成 -1 時

NAME            ID              SIZE    PROCESSOR       UNTIL
gemma:latest    a72c7f4d0a15    6.7 GB  100% CPU        Forever

OLLAMA_MAX_LOADED_MODELS

The maximum number of models that can be loaded concurrently (ollama ps)

3 for CPU inference

OLLAMA_NUM_PARALLEL

The maximum number of parallel requests each model will process at the same time.

The default will auto-select either 4 or 1 based on available memory.

Preload Model

# Preload "gemma2" By CLI

ollama run gemma2 ""

NAME            ID              SIZE    PROCESSOR       UNTIL
gemma2:latest   ff02c3702f32    6.8 GB  100% CPU        Forever

Process

ollama    347024 15.5  2.7 3226308 888348 ?      Ssl  07:34   0:28 /usr/bin/ollama serve
ollama    347093 21.5 20.6 7919628 6795216 ?     Sl   07:37   0:07 
  /tmp/ollama109889705/runners/cpu_avx2/ollama_llama_server 
  --model /usr/share/ollama/.ollama/models/blobs/sha256-...
  --ctx-size 2048 --batch-size 512 --embedding --log-disable --no-mmap --parallel 1 --port 46275

# Preload By "generate" endpoint

curl http://localhost:11434/api/generate -d '{"model": "gemma2"}'

Settings

UNTIL Forever

Environment="OLLAMA_KEEP_ALIVE=-1"

Unload a model from memory (Freeing RAM / VRAM)

curl http://localhost:11434/api/generate \
-d '{"model": "gemma2", "keep_alive": 0}'

Nginx Proxy(https)

location /ollama-api/ {
    proxy_pass http://ollama:11434/;

    # Disable buffering for the streaming responses
    proxy_set_header Connection '';
    proxy_http_version 1.1;
    chunked_transfer_encoding off;
    proxy_buffering off;
    proxy_cache off;
    
    # Longer timeouts (1hr)
    keepalive_timeout 3600;
    proxy_read_timeout 3600;
    proxy_connect_timeout 3600;
    proxy_send_timeout 3600;
}

瀏覽次數： 667

夢想家