AI 基本資訊

最後更新: 2024-08-29

目錄

  • 功能 Safetensors
  • Tokens
  • context
  • prompt
  • Quantization Methods
  • LLM Format: AWQ, GGUF, and GPTQ
  • LLM Parameter

 


Capabilities - vision

 

# .jpg or .png files

Object detection - "tell me what do you see in this picture? ./pic.jpg"

Text recognition -  "what does the text say? ./wordart.png"

 


Safetensors

 

This repository implements a new simple format for storing tensors safely (as opposed to pickle)
 and that is still fast (zero-copy).

 


Tokens

 

GPT models see text in the form of tokens

a tokenizer can split the text string into a list of tokens

Different models use different encodings.

i.e.

gpt-4, gpt-3.5-turbo: cl100k_base

 


context

 

set the context or topic for the discussion

This helps the model understand the direction of the conversation.

 


prompt

 

The prompt is typically sent to ChatGPT at the beginning of a conversation or

when requesting a specific response from the model.

 


Quantization Methods

 

Speed will be closely related to the model file size.

Smaller model file, faster inference, usually lower accuracy.

Perplexity

Perplexity 愈細愈好. "ppl" = "perplexity loss"

7b 的 ppl > 13b ppl

f16 係沒有 quantized. 所以佢係同級最好.

quantized 後 file size 會變細 => ppl 就會升

7b-python-fp16 = 13G
13b-python-fp16 = 26G

fp16 = 16 bits = 2 bytes

13b = 13-billion

RAM to load(Size) = 13-billion * 2 = 26-billion bytes (or ~26GB)

K_X

k-quants

"K" stands for a specific type of quantization method used.

K-quantizations should be better, at the same file size, then the other ones.

Q[Number]: This indicates the bit depth of quantization.

For example

q5 means 5-bit quantization

q4 means 4-bit quantization

S M L = means small medium large

13b-python-q4_K_M = 7.9
13b-python-q4_K_S = 7.4G

_0 & _1

llama.cpp supports two quantization types:

"type-0" (Q4_0, Q5_0)
"type-1" (Q4_1, Q5_1)

 


LLM Format: AWQ, GGUF, and GPTQ

 

AWQ, GGUF, and GPTQ are different methods of quantization used for compressing
 and optimizing large language models (LLMs)

GPTQ (Gradient Projection Quantization)

GPTQ dynamically dequantizes its weights to float16 to improve performance
  while keeping memory usage low. This method is optimized primarily for GPU inference and performance.

GGUF (GPT-Generated Unified Format)

previously known as GGML, is a quantization method that allows for running LLMs on the CPU,
  with the option to offload some layers to the GPU for a speed boost.

 


Open Webui Settings

 

Temperature # Default: 0.8

This controls how creative the model is.

While a lower temperature keeps things more focused and predictable.

Top K # Default: 40

This limits the model's choices when predicting the next word.

It only considers the top K most likely words.

Reduces the probability of generating nonsense.

A higher value (e.g. 100) will give more diverse answers,
while a lower value (e.g. 10) will be more conservative.

Top P # Default: 0.9

Works together with top-k.

Similar to Top K, but instead of using a fixed number, it uses a probability threshold.

It selects words until the cumulative probability reaches the set value (P).

A higher value (e.g., 0.95) will lead to more diverse text,
while a lower value (e.g., 0.5) will generate more focused and conservative text.

Min P # Default: 0.0

This sets a minimum probability for a word to be included in the output.

Words below this threshold are discarded.

Frequency Penalty (repeat_penalty) # Default: 1.1

This discourages the model from using overly common words.
 It helps make the text more diverse and interesting.

Repeat Last N (repeat_last_n) # Default: 64

This prevents the model from repeating the same words or phrases too many times in a row.

0 = disabled, -1 = num_ctx

Tfs Z (default: 1)

Tail free sampling

To reduce the impact of less probable tokens from the output

A higher value (e.g., 2.0) will reduce the impact more. 1.0 = disables

Context Length (num_ctx) # Default: 2048

This is the maximum amount of text the model can consider when generating its response.

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 4096
  }
}'

Batch Size (num_batch)

This determines how many examples the model processes at once.

Tokens To Keep On Context Refresh (num_keep)

This controls how much of the previous conversation is retained when the context is refreshed.

Max Tokens (num_predict) # Default: 128

This limits the total number of tokens (words or parts of words) the model can generate in a single response.

seed # Default: 0

Sets the random number seed to use for generation.

Setting this to a specific number will make the model generate the same text for the same prompt.

stop

Sets the stop sequences to use.

When this pattern is encountered the LLM will stop generating text and return.

mirostat*

mirostat # Default: 0

Enable Mirostat sampling for controlling perplexity.
(0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)

mirostat_eta # Default: 0.1

Influences how quickly the algorithm responds to feedback from the generated text.

A lower learning rate will result in slower adjustments,
while a higher learning rate will make the algorithm more responsive.

mirostat_tau # Default: 5.0)

Controls the balance between coherence and diversity of the output.
A lower value will result in more focused and coherent text.

 

 

 

Creative Commons license icon Creative Commons license icon