最後更新: 2024-08-29
目錄
- 功能 Safetensors
- Tokens
- context
- prompt
- Quantization Methods
- LLM Format: AWQ, GGUF, and GPTQ
- LLM Parameter
Capabilities - vision
# .jpg or .png files
Object detection - "tell me what do you see in this picture? ./pic.jpg"
Text recognition - "what does the text say? ./wordart.png"
Safetensors
This repository implements a new simple format for storing tensors safely (as opposed to pickle)
and that is still fast (zero-copy).
Tokens
GPT models see text in the form of tokens
a tokenizer can split the text string into a list of tokens
Different models use different encodings.
i.e.
gpt-4, gpt-3.5-turbo: cl100k_base
context
set the context or topic for the discussion
This helps the model understand the direction of the conversation.
prompt
The prompt is typically sent to ChatGPT at the beginning of a conversation or
when requesting a specific response from the model.
Quantization Methods
Speed will be closely related to the model file size.
Smaller model file, faster inference, usually lower accuracy.
Perplexity
Perplexity 愈細愈好. "ppl" = "perplexity loss"
7b 的 ppl > 13b ppl
f16 係沒有 quantized. 所以佢係同級最好.
quantized 後 file size 會變細 => ppl 就會升
7b-python-fp16 = 13G 13b-python-fp16 = 26G
fp16 = 16 bits = 2 bytes
13b = 13-billion
RAM to load(Size) = 13-billion * 2 = 26-billion bytes (or ~26GB)
K_X
k-quants
"K" stands for a specific type of quantization method used.
K-quantizations should be better, at the same file size, then the other ones.
Q[Number]: This indicates the bit depth of quantization.
For example
q5 means 5-bit quantization
q4 means 4-bit quantization
S M L = means small medium large
13b-python-q4_K_M = 7.9 13b-python-q4_K_S = 7.4G
_0 & _1
llama.cpp supports two quantization types:
"type-0" (Q4_0, Q5_0)
"type-1" (Q4_1, Q5_1)
LLM Format: AWQ, GGUF, and GPTQ
AWQ, GGUF, and GPTQ are different methods of quantization used for compressing
and optimizing large language models (LLMs)
GPTQ (Gradient Projection Quantization)
GPTQ dynamically dequantizes its weights to float16 to improve performance
while keeping memory usage low. This method is optimized primarily for GPU inference and performance.
GGUF (GPT-Generated Unified Format)
previously known as GGML, is a quantization method that allows for running LLMs on the CPU,
with the option to offload some layers to the GPU for a speed boost.
Open Webui Settings
Temperature # Default: 0.8
This controls how creative the model is.
While a lower temperature keeps things more focused and predictable.
Top K # Default: 40
This limits the model's choices when predicting the next word.
It only considers the top K most likely words.
Reduces the probability of generating nonsense.
A higher value (e.g. 100) will give more diverse answers,
while a lower value (e.g. 10) will be more conservative.
Top P # Default: 0.9
Works together with top-k.
Similar to Top K, but instead of using a fixed number, it uses a probability threshold.
It selects words until the cumulative probability reaches the set value (P).
A higher value (e.g., 0.95) will lead to more diverse text,
while a lower value (e.g., 0.5) will generate more focused and conservative text.
Min P # Default: 0.0
This sets a minimum probability for a word to be included in the output.
Words below this threshold are discarded.
Frequency Penalty (repeat_penalty) # Default: 1.1
This discourages the model from using overly common words.
It helps make the text more diverse and interesting.
Repeat Last N (repeat_last_n) # Default: 64
This prevents the model from repeating the same words or phrases too many times in a row.
0 = disabled, -1 = num_ctx
Tfs Z (default: 1)
Tail free sampling
To reduce the impact of less probable tokens from the output
A higher value (e.g., 2.0) will reduce the impact more. 1.0 = disables
Context Length (num_ctx) # Default: 2048
This is the maximum amount of text the model can consider when generating its response.
curl http://localhost:11434/api/generate -d '{ "model": "llama3", "prompt": "Why is the sky blue?", "options": { "num_ctx": 4096 } }'
Batch Size (num_batch)
This determines how many examples the model processes at once.
Tokens To Keep On Context Refresh (num_keep)
This controls how much of the previous conversation is retained when the context is refreshed.
Max Tokens (num_predict) # Default: 128
This limits the total number of tokens (words or parts of words) the model can generate in a single response.
seed # Default: 0
Sets the random number seed to use for generation.
Setting this to a specific number will make the model generate the same text for the same prompt.
stop
Sets the stop sequences to use.
When this pattern is encountered the LLM will stop generating text and return.
mirostat*
mirostat # Default: 0
Enable Mirostat sampling for controlling perplexity.
(0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
mirostat_eta # Default: 0.1
Influences how quickly the algorithm responds to feedback from the generated text.
A lower learning rate will result in slower adjustments,
while a higher learning rate will make the algorithm more responsive.
mirostat_tau # Default: 5.0)
Controls the balance between coherence and diversity of the output.
A lower value will result in more focused and coherent text.