最後更新: 2024-05-07
目錄
介紹
它是一個可以在本機運行知名 LLM (i.e. Llama 3, Phi 3, Mistral, Gemma) 的軟件
HomePage
Memory requirement
* At least (在用 CPU 行的情況)
Models RAM(OS) 7B 8 GB 13B 16 GB 33B 32 GB
Notes
Llama 3(8B) File Size: 4.7GB # ollama run llama3 Llama 3(70B) File Size: 40GB # ollama run llama3:70b
Manual Install
V0.3.8 # 1.39 GB <- 在 0.3.7 開始發大 (ollama
binary along with required libraries)
V0.3.0 # 558 MB
Get binary filie
mkdir -p /usr/src/ollama; cd /usr/src/ollama
V=v0.3.8
LINK=https://github.com/ollama/ollama/releases/download/$V/ollama-linux-amd64.tgz
wget $LINK -O ollama-linux-amd64-$V.tgz
tar -zxf ollama-linux-amd64-$V.tgz
mkdir /opt/ollama
mv bin lib /opt/ollama
chown ollama: /opt/ollama -R
ln -sf /opt/ollama/bin/ollama /usr/bin/ollama
ollama -v
Warning: could not connect to a running Ollama instance Warning: client version is 0.3.8
Startup service
/etc/systemd/system/ollama.service
[Unit] Description=Ollama Service After=network-online.target [Service] ExecStart=/usr/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="OLLAMA_HOST=0.0.0.0" Environment="OLLAMA_MAX_QUEUE=3" Environment="OLLAMA_KEEP_ALIVE=-1" Environment="OLLAMA_MAX_LOADED_MODELS=2" Environment="OLLAMA_NUM_PARALLEL=1" [Install] WantedBy=default.target
systemctl enable ollama --now
ps aux | grep ^ollama
# 未 load model 時它 8xx MB
ollama 347024 102 2.7 3225988 887676 ? Ssl 07:34 0:28 /usr/bin/ollama serve
CLI
Pull a model
# Ollama 可下載的 LLM https://ollama.com/library
ollama pull llama3
# 沒有注明 size (:7b) 的話就會下載 default 那個 size
ollama pull codellama:7b
ollama pull codellama:13b-python
Remove a model
ollama rm llama3
Copy a model
ollama cp llama3 my-model
Displays currently loaded models
ollama ps # 當設定成 -1 時
NAME ID SIZE PROCESSOR UNTIL
gemma:latest a72c7f4d0a15 6.7 GB 100% CPU Forever
UNTIL 代表 cache 在 RAM 多耐
PROCESSOR
"48%/52% CPU/GPU" means the model was loaded partially onto both the GPU and into system memory
* The entire model is traversed for each token. 所以分開了會慢左
Multiple GPUs
If the model will entirely fit on any single GPU, Ollama will load the model on that GPU.
=> it reduces the amount of data transfering across the PCI bus during inference
Model memory usage
Model data is memory mapped and shows up in file cache
Ollama processes 係有自己的 VSZ, RSS
ollama show
Opts
- --modelfile
- --parameters
- --template
- --system
ollama show codellama:7b Model arch llama parameters 7B quantization Q4_0 context length 16384 embedding length 4096 Parameters rope_frequency_base 1e+06 stop "[INST]" stop "[/INST]" stop "<<SYS>>" stop "<</SYS>>" License LLAMA 2 COMMUNITY LICENSE AGREEMENT Llama 2 Version Release Date: July 18, 2023
Find Model File Location
models 的 default 位置: ~ollama/.ollama/models
可以用 ENV OLLAMA_MODELS 去修改它
ollama show --modelfile codellama:13b
FROM /usr/share/ollama/.ollama/models/blobs/sha256-...
Notes
沒加 version 即是 ":latest"
phi3.5 = phi3.5:latest
tree /usr/share/ollama/.ollama/models/
/usr/share/ollama/.ollama/models/ ├── blobs │ ├── sha256-2e0493f67d0c8c9c68a8aeacdf6a38a2151cb3c4c1d42accf296e19810527988 │ ├── sha256-35e261fb2c733bb8c50e83e972781ec3ee589c83944aaac582e13ba679ac3592 │ ├── sha256-590d74a5569b8a20eb2a8b0aa869d1d1d3faf6a7fdda1955ae827073c7f502fc │ ├── sha256-7f6a57943a88ef021326428676fe749d38e82448a858433f41dae5e05ac39963 │ ├── sha256-8c17c2ebb0ea011be9981cc3922db8ca8fa61e828c5d3f44cb6ae342bf80460b │ └── sha256-e73cc17c718156e5ad34b119eb363e2c10389a503673f9c36144c42dfde8334c └── manifests └── registry.ollama.ai └── library └── codellama └── 13b
# 查看 mediaType, digest
cat manifests/registry.ollama.ai/library/codellama/13b | jq
Run AI Bot
i.e.
ollama run llama3
ollama run codellama:7b
Multiline input
>>> """Hello, ... world! ... """
/x
- /? Help
- /set Set session variables
- /show Show model information
- /load <model> Load a session or model
- /save <model> Save your current session
- /clear Clear session context
- /bye Exit
/show info
Model details: Family qwen2 Parameter Size 8B Quantization Level Q4_0
REST API
* ollama default api port: 11434/tcp
更改 CLI client 連 Server 的 port
export OLLAMA_HOST="server:port"
Generate a response
curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt":"Why is the sky blue?" }'
Chat with a model
curl http://localhost:11434/api/chat -d '{ "model": "llama3", "messages": [ { "role": "user", "content": "why is the sky blue?" } ] }'
Allow listening on all local interfaces
/etc/systemd/system/ollama.service
[Service] ... # Environment="OLLAMA_HOST=0.0.0.0:8080" # 使用 Default 的 11434 Environment="OLLAMA_HOST=0.0.0.0"
systemctl daemon-reload
systemctl restart ollama
修改其他設定
ollama.service
Environment="OLLAMA_MAX_QUEUE=3" Environment="OLLAMA_KEEP_ALIVE=-1"
systemctl daemon-reload
systemctl restart ollama
說明
OLLAMA_MAX_QUEUE
Default: 512
If too many requests are sent to the server,
it will respond with a 503 error indicating the server is overloaded.
OLLAMA_KEEP_ALIVE
By default models are kept in memory for 5 minutes before being unloaded.
Use the keep_alive parameter with either the /api/generate and /api/chat API endpoints to control how long the model is left in memory.
e.g. 1h / 60m / 3600
# To preload a model and leave it in memory
URL=http://localhost:11434/api/generate
curl $URL -d '{"model": "llama3", "keep_alive": -1}'
# To unload a model from memory
curl $URL -d '{"model": "llama3", "keep_alive": 0}'
Checking
# 當 "UNTIL" 倒數到 0 後 Model 就會被 unload
NAME ID SIZE PROCESSOR UNTIL
gemma2:latest ff02c3702f32 9.0 GB 100% CPU 8 minutes from now
ollama ps # 當設定成 -1 時
NAME ID SIZE PROCESSOR UNTIL
gemma:latest a72c7f4d0a15 6.7 GB 100% CPU Forever
OLLAMA_MAX_LOADED_MODELS
The maximum number of models that can be loaded concurrently (ollama ps)
3 for CPU inference
OLLAMA_NUM_PARALLEL
The maximum number of parallel requests each model will process at the same time.
The default will auto-select either 4 or 1 based on available memory.
Preload Model
# Preload "gemma2" By CLI
ollama run gemma2 ""
NAME ID SIZE PROCESSOR UNTIL gemma2:latest ff02c3702f32 6.8 GB 100% CPU Forever
Process
ollama 347024 15.5 2.7 3226308 888348 ? Ssl 07:34 0:28 /usr/bin/ollama serve
ollama 347093 21.5 20.6 7919628 6795216 ? Sl 07:37 0:07
/tmp/ollama109889705/runners/cpu_avx2/ollama_llama_server
--model /usr/share/ollama/.ollama/models/blobs/sha256-...
--ctx-size 2048 --batch-size 512 --embedding --log-disable --no-mmap --parallel 1 --port 46275
# Preload By "generate" endpoint
curl http://localhost:11434/api/generate -d '{"model": "gemma2"}'
Settings
UNTIL Forever
Environment="OLLAMA_KEEP_ALIVE=-1"
Unload a model from memory (Freeing RAM / VRAM)
curl http://localhost:11434/api/generate \
-d '{"model": "gemma2", "keep_alive": 0}'
Nginx Proxy(https)
location /ollama-api/ { proxy_pass http://ollama:11434/; # Disable buffering for the streaming responses proxy_set_header Connection ''; proxy_http_version 1.1; chunked_transfer_encoding off; proxy_buffering off; proxy_cache off; # Longer timeouts (1hr) keepalive_timeout 3600; proxy_read_timeout 3600; proxy_connect_timeout 3600; proxy_send_timeout 3600; }