最後更新: 2024-06-02
目錄
- Installation
- Anonymous Statistics
- Upgrade
- Usage
- Configure
- Database
- Nginx
- Apache Reverse Proxy Settings
- Netdata registry
- Web Setting
- Plugins
- Disable IPv6 Metrics
- edit-config
-
Health monitoring
+ - Stopping notifications for individual alarms
-
Central Netdata server (streaming)
+ statsd - Database Queries
- Export and import a snapshot
- Performance Tuning
- Netdata 的自身 Info.
- Monitoring Kernel Memory de-duplication performance
- Custom Page
- V1.37
介紹
A real-time performance and health monitoring solution
Designed to:
- Solve the centralization problem of monitoring (Scales to infinity)
- Replace the console for performance troubleshooting
HomePage: https://www.netdata.cloud/
特點
- 1s granularity
- Zero disk I/O (預設所有資料存放在 RAM)
Port:
- 19999/tcp # Web panel
- 8125/TCP, 8125/UDP # statsd
運作:
Query | Collect -> Store -> Stream -> Archive | Check
positive / negative values
- positive values: read, input, inbound, received
- negative values: write, output, outbound, sent
monitoring agent
- metrics collector
- time-series database
- metrics visualizer
- alarms notification engine
Security Design
- Netdata daemon runs as a normal system user
- plugins perform a hard coded data collection job
- plugins & Netdata slaves unidirectional: from the plugin towards the Netdata daemon
- dashboards are read-only
- data do not leave the server where they are collected
- Netdata servers do not talk to each other
- your browser connects all the Netdata servers
Installation
netdata 一共有 4 重安裝方式
- Linux distribution Binary Packages
- Linux 64bit pre-built static binary
- Run Netdata in a Docker container
- Install Netdata on Linux manually
方法 1: Binary Packages (Linux Distribution)
# U22 (netdat v1.33)
# Rocky8 (netdata v1.36.1)
dnf install netdata
systemctl enable netdata --now
Checking
netstat -ntlp | grep netdata
tcp 0 0 127.0.0.1:8125 0.0.0.0:* LISTEN 22515/netdata tcp 0 0 0.0.0.0:19999 0.0.0.0:* LISTEN 22515/netdata
方法 2: Static Binary
mkdir /usr/src/netdata
cd /usr/src/netdata
# 查看有什麼 Version
https://github.com/netdata/netdata/releases
- V=v1.46.3 #
- V=v1.45.6 # 2024-06-05
- V=v1.45.5 # 2024-05-21
LINK=https://github.com/netdata/netdata/releases/download/$V/netdata-latest.gz.run
wget $LINK -O netdata.${V}.gz.run
bash netdata.${V}.gz.run
Remark
- 會安裝在 /opt/netdata
- 會建立 User Account: 'netdata'
How to check Netdata version in UI
# V1.4.X
go to the Nodes tab (頭頂位置), and click on the Info icon ("i")
Anonymous Statistics
Starting with v1.12 Netdata also collects anonymous statistics on certain events
rpm -ql netdata | grep statistics
/usr/libexec/netdata/plugins.d/anonymous-statistics.sh
Code
if [ -f "/etc/netdata/.opt-out-from-anonymous-statistics" ] || ... exit 0 fi
To opt-out from sending anonymous statistics
touch /etc/netdata/.opt-out-from-anonymous-statistics
log
grep anonymous /var/log/netdata/error.log
... netdata INFO : MAIN : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'OK' '-'
Upgrade
chmod 700 ./netdata-latest.gz.run
# 過程會自動 stop / start netdata
./netdata-latest.gz.run --accept
Usage
Web: http://your.server.ip:19999/
the current charts zooming (SHIFT + mouse wheel over a chart),
the highlighted time-frame (ALT + select an area on a chart),
Auto-detection of data collection sources
This auto-detection process happens only once, when Netdata starts.
Exceptions: Containers and VMs are auto-detected forever
Configure
Get running config:
http://127.0.0.1:19999/netdata.conf
Config File Location:
/opt/netdata/etc/netdata/netdata.conf
獲得當時 config file(All in one file)
wget -O netdata.conf http://localhost:19999/netdata.conf
RAM & CPU Usage
[global] # Enable KSM to half Netdata memory requirement history = 3600 update every = 1 process scheduling policy = idle
Notes
update every For data collection. Default: 1
Memory modes (DB)
- map # data are in memory mapped files (swap)
- dbengine(Default)
- ram # data are purely in memory. Data are never saved on disk.
- save # data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata restart.
- none # without a database (collected metrics can only be streamed to another Netdata)
[global] memory mode = dbengine
Data are in memory mapped files. This works like the swap.
(constant write on your disk, does not support KSM)
For each chart, Netdata maps the following files:
-
chart/main.db # chart information.
# Every time data are collected for a chart, this is updated. - chart/dimension_name.db # round robin database
The data are in database files.
Files
# Debian: /var/cache/netdata/dbengine
ls /opt/netdata/var/cache/netdata/dbengine
datafile-1-0000000001.ndf
journalfile-1-0000000001.njf
datafile-1-0000000002.ndf
journalfile-1-0000000002.njf <- number 愈大愈新
...
Size of each datafile is determined automatically by Netdata. (4MB ~ 512MB)
Netdata will decide a datafile size trying to maintain about 50 datafiles for the whole database
----
njf = journal file v1
holds information about the transactions in its datafile (4KB)
----
There is some amount of RAM dedicated to data caching and indexing
# Unit: MiB page cache size = 32
The number of history entries is not fixed (depends on the configured disk space)
"dbengine" is the only mode that supports changing "update_every" without losing the previously stored metrics
"history" configuration option is meaningless for "memory mode = dbengine"
Suggest to use this mode on nodes that also run other applications
Database Engine uses direct I/O to avoid polluting the OS filesystem caches
# Unit: MiB dbengine disk space = 256
----
The DB engine stores chart metric values in 4k pages in memory.
Each chart dimension gets its own page to store consecutive values generated from the data collectors.
When those pages fill up they are slowly compressed and flushed to disk.
=> 亦即是每 17 min. flush 一次
# 每類 chart 的 cache = 4 kbyte, 每隻 record 4 bytes, 在每秒 get 一次的情況下, 1024 秒就 full
4096 / 4 = 1024 sec (dimension: 1s)
When the disk quota is exceeded the oldest values are removed from the DB engine at real time
* When we query the DB engine for data
=> trigger disk read I/O requests that fill the Page Cache with the requested pages
* The Database Engine uses direct I/O to avoid polluting the OS filesystem caches.
----
Config
/opt/netdata/etc/netdata/netdata.conf
[global] memory mode = dbengine # Unit: MiB page cache size = 32 dbengine disk space = 256
* There is one DB engine instance per Netdata host/node
* All DB engine instances, for localhost and all other streaming recipient nodes inherit their configuration from netdata.conf
* There are explicit memory requirements per DB engine instance
File descriptor
The Database Engine may keep a significant amount of files open per instance
(at least 50 file descriptors available per dbengine instance)
systemctl edit netdata
[Service] LimitNOFILE=65536
Remark: /etc/sysctl.conf: "fs.file-max = 65536"
ram
data are purely in memory. Data are never saved on disk. (Supports KSM)
save (the default)
Data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata restart.
It also uses mmap() and supports KSM.
Performance
Disable data collection
disable data collection plugins that you don't need => Save both CPU and RAM
e.g.
Web Panel 右下角 metrics 及 charts 的數量會變小了
Every second, Netdata collects 1,560 metrics on SERVER, presents them in 320 charts ...
OOM Score
[global] OOM score = 1000
Higher => This means Netdata will be the first to be killed when your server runs out of memory.
Checking
cat /proc/$(pidof netdata)/oom_score
Scheduling Policy
[global] process scheduling policy = idle
By default Netdata runs with the idle process scheduling policy,
so that it uses CPU resources, only when there is idle CPU to spare.
Database
Ram Usage (for DB)
The default history is 3600 entries,
If data collection frequency is set to 1 second. You will have just one hour of data.
[global] history = 3600
It will need 14.4KB for each chart dimension (4 bytes for the value * the entries of its history(3600))
If you need 1000 dimensions, they will occupy just 14.4MB.
Netdata offers all its round robin database to kernel for deduplication
KSM is a solution that will provide 60+% memory savings to Netdata.
# by default 0; 1 for the kernel to spawn ksmd
echo 1 > /sys/kernel/mm/ksm/run
Tiers
storage tiers = 3 update every = 1 dbengine tier 1 update every iterations = 60 dbengine tier 2 update every iterations = 60
i.e.
If a metric is collected per second(update every) in Tier 0,
then we will have a data point every minute in tier 1 and every hour in tier 2
Retention
The general rule is that Netdata needs about 1 byte per data point on disk for tier 0,
and 4 bytes per data point on disk for tier 1 and above.
* dbengine disk space MB (deprecated)
dbengine multihost disk space MB = 256 dbengine tier 1 update every iterations = 60 dbengine tier 2 multihost disk space MB = 64
cache = /var/cache/netdata
- dbengine/
- dbengine-tier1/
- dbengine-tier2/
1000 metrics/second
- 3 days
- 22 days
- 2 years
Cache
dbengine page cache size MB = 32 dbengine tier 2 page cache size MB = 8 dbengine tier 1 page cache size MB = 16
Memory for concurrently collected metrics
DBENGINE memory in KiB
METRICS x (TIERS - 1) x 4KiB x 2 + "dbengine page cache size MB"
* TIERS By default 3 ( -1 when using 3+ tiers)
i.e.
# 3 storage tiers & 2k metrics
2000 x 3 x 4 x 2 / 1024 MiB ~ 47 MiB
dbengine page cache size MB = 32 MiB (Default)
Total Netdata memory in MiB = "Metric cardinality factor" x "DBENGINE memory" + "dbengine page cache"
The cardinality factor is usually between 3 or 4 and depends mainly on the ephemerality of the collected metrics.
The more ephemeral the infrastructure, the higher the factor.
nginx
IP Level ACL
* best and the suggested way to protect Netdata
=> Expose Netdata only in a private LAN => IP Level
[web] bind to = 10.1.1.1:19999 localhost:19999
Login with username & password
Use web server to provide authentication (in front of all your Netdata servers)
Web Server Setting (nginx)
Nginx to forward requests to netdata
HTTP auth file: /etc/nginx/netdata.users
URL: https://your-server/netdata/
# Running netdata as a subfolder to an existing virtual host
server { ... include /etc/nginx/templates/netdata.tmpl; }
netdata.tmpl
location = /status { return 301 /status/; } location ~ /status/(?<ndpath>.*) { #access_log /var/log/nginx/netdata.log; #error_log /var/log/nginx/netdata.log; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Forwarded-Host $host; proxy_set_header X-Forwarded-Server $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_http_version 1.1; proxy_pass_request_headers on; proxy_set_header Connection "keep-alive"; proxy_store off; proxy_pass http://127.0.0.1:19999/$ndpath$is_args$args; gzip on; gzip_proxied any; gzip_types *; auth_basic "Authentication Required"; auth_basic_user_file /etc/nginx/netdata.users; }
netdata 在 proxy 後 Settings
[web] enable gzip compression = no [logs] # access = /var/log/netdata/access.log access = off [web] #bind to = localhost bind to = unix:/var/run/netdata/netdata.sock
當使用 unix socket 時
nginx
#proxy_pass http://127.0.0.1:19999/$ndpath$is_args$args; proxy_pass http://unix:/var/run/netdata/netdata.sock:/$ndpath$is_args$args;
Apache Reverse Proxy Settings
mod_proxy & mod_proxy_http
vhosts.conf
<VirtualHost *:80> ProxyRequests Off ProxyPreserveHost On <Proxy *> Require all granted </Proxy> ProxyPass "/netdata/" "http://localhost:19999/" \ connectiontimeout=5 timeout=30 keepalive=on ProxyPassReverse "/netdata/" "http://localhost:19999/" # if the user did not give the trailing / RewriteEngine On RewriteRule ^/netdata$ http://%{HTTP_HOST}/netdata/ [L,R=301] # add a <Location /netdata/> section <Location /netdata/> AuthType Basic AuthName "Protected site" AuthUserFile htpasswd Require valid-user Require all denied </Location> </VirtualHost>
Netdata registry
registry = node menu (on top left corner of the Netdata dashboards)
目的:
- enables the Netdata cloud features, such as the node view
- multiple Netdata are integrated into one distributed application (distributed monitoring)
The registry keeps track of 4 entities:
- machine_guid: a random GUID generated by each Netdata (first time it starts)
- person_guid: the web browsers accessing the Netdata installations (first time it sees a new web browser)
- URLs of Netdata installations
- accounts: i.e. the information used to sign-in via one of the available sign-in methods.
Default registry: https://registry.my-netdata.io
Who talks to the registry?
Your web browser only!
Flow
Browser --> netdata
<-URL to Registry-
Browser --> Registry
Run your own registry
# Server (registry)
* Every Netdata can be a registry
[registry] enabled = yes registry to announce = http://your.registry:19999 allow from = 192.168.123.*
Remark
(1) registry 有 DB database: /var/lib/netdata/registry/*.db
- registry-log.db, the transaction log
- registry.db, the database
(2) IPs allowed by [registry].allow from should also be allowed by [web].allow connection from.
# Client (netdata)
Advertise it to registry
[registry]
enabled = no
registry to announce = http://your.registry:19999
Web Setting
Disable Web Dashboard
[web] mode = none
Threads
# The default number of processor threads is min(cpu cores, 6)
[web] web server threads = 4 web server max sockets = 512
Access lists
Netdata supports access lists in netdata.conf:
[web] allow connections from = localhost * allow dashboard from = localhost * allow badges from = * allow streaming from = * allow netdata.conf from = localhost allow management from = localhost
說明
allow badges from
checks if the API request is for a badge. Badges are not matched by allow dashboard from.
allow netdata.conf from
checks the IP to allow http://netdata.host:19999/netdata.conf
IPs allowed by allow netdata.conf from should also be allowed by allow connections from
Plugins
Internal, External, Modular plugins
- Internal data collection plugins (running inside the Netdata daemon)
- External data collection plugins (independent processes, sending data to Netdata over pipes)
- Modular plugin orchestrators (external plugins that have multiple data collection modules)
netdata.conf
Disable a plug-in
在 config folder node.d python.d ..
[plugins] proc = yes diskspace = yes ... node.d = yes
Per plug-in setting
[plugin:python.d] # update every = 5
獲得現在原整 Settings
cd /etc/netdata
mv netdata.{conf,bak}
wget -O netdata.conf http://localhost:19999/netdata.conf
Disable IPv6 Metrics
修改 netdata.conf
[plugin:proc] /proc/net/sockstat6 = no
[plugin:proc:/proc/net/snmp6] filename to monitor = none
systemctl restart netdata
Disable check 某些 NIC
[plugin:proc:/proc/net/dev:lxcbr0] enabled = no
"diskspace" Settings
Disabling performance metrics
# For individual device
[plugin:proc:/proc/diskstats:sda] enable performance metrics = no
# Disable By path / filesystem
* disable data collection plugins that you don't need => Save both CPU and RAM
[plugin:proc:diskspace] exclude space metrics on paths = /tmp /dev/* /run/* /var/* exclude space metrics on filesystems = *sshfs fusectl autofs
edit-config
此 command 會將 Stock config COPY 到 User config
- Stock config files at: '/opt/netdata/usr/lib/netdata/conf.d'
- User config files at: '/opt/netdata/etc/netdata'
USAGE:
./edit-config [options] [FILENAME]
Get a list of known config files
./edit-config --list
e.g.
./edit-config go.d.conf
之後在 Editor 內修改
modules: ... logind: no
Health monitoring
Disable monitoring
/opt/netdata/etc/netdata/netdata.conf
[health] # Default: yes enabled = no
/opt/netdata/etc/netdata/edit-config health_alarm_notify.conf
SEND_EMAIL="YES"
* Default alarms shipped with Netdata.
Alerm to multi mailbox
health_alarm_notify.conf
# to receive only critical alarms, set it to "root|critical" # 沒有 "to:" 時發比誰 DEFAULT_RECIPIENT_EMAIL="[email protected] [email protected]|critical" ... # Alert "to: sysadmin" 時發 mail 比誰 role_recipients_email[sysadmin]="${DEFAULT_RECIPIENT_EMAIL}"
Testing Notifications
# become user netdata
su -s /bin/bash netdata
# enable debugging info on the console
export NETDATA_ALARM_NOTIFY_DEBUG=1
# send test alarms to sysadmin
/opt/netdata/usr/libexec/netdata/plugins.d/alarm-notify.sh test
... --- BEGIN sendmail command --- /usr/sbin/sendmail -t --- END sendmail command --- 2021-01-07 18:23:17: alarm-notify.sh: INFO: sent email notification for: hypervisor.datahunter.org test.chart.test_alarm is CLEAR to '[email protected]' # OK
Stop notifications for individual alarms (silencing the alarm)
Step1: Find the alarm configuration file
e.g.
/opt/netdata/usr/lib/netdata/conf.d/health.d/net.conf
Step2: Edit the file to enable silencing
to: sysadmin
改成
to: silent
Example
NIC Full Loading
alarm: 5m_sent_traffic_overflow on: net.ens192 os: linux hosts: * families: * #lookup: average -5m unaligned absolute of received lookup: average -5m unaligned absolute of sent calc: ($interface_speed > 0) ? ($this * 100 / (100 * 1000)) : ( nan ) units: % every: 60s warn: $this > (($status >= $WARNING) ? (80) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (90)) delay: down 1m multiplier 1.5 max 1h info: interface sent bandwidth usage over net device speed max to: sysadmin
content in alerm email
[ $this = 94.079845 ] [ $status = 1 ] [ $CRITICAL = 4 ]
# CPU Usage
template: 10min_cpu_usage
on: system.cpu
os: linux
hosts: *
lookup: average -10m unaligned of user,system,softirq,irq,guest
units: %
every: 1m
warn: $this > (($status >= $WARNING) ? (75) : (85))
crit: $this > (($status == $CRITICAL) ? (85) : (95))
delay: down 15m multiplier 1.5 max 1h
info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
to: sysadmin
# 用 variable
template: 1m_received_traffic_overflow on: net.net os: linux hosts: * families: * lookup: average -1m unaligned absolute of received # $interface_speed 在之前的 template 定義出來 calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed * 1000)) : ( nan ) units: % every: 10s warn: $this > (($status >= $WARNING) ? (80) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (90)) delay: down 1m multiplier 1.5 max 1h info: interface received bandwidth usage over net device speed max to: sysadmin
alarm vs template
Alarms
It attached to specific charts and use the alarm label. (net.eth0)
Alarms have higher precedence and will override templates.
If an alarm and template entity have the same name and attach to the same chart, Netdata will use the alarm.
Need to find the context? Hover over the date on any given chart and look at the tooltip.
Templates
define rules that apply to all charts of a specific context(net.net), and use the template label.
Templates help you apply one entity to all disks, all network interfaces, all MySQL databases, and so on.
解說
on:
Which chart the entity listens to
lookup:
This line makes a database lookup to find a value. This result of this lookup is available as $this
lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS] [foreach DIMENSIONS]
METHOD
one of average, min, max, sum, incremental-sum
average:
Calculate the average of all the metrics collected.
percentage:
Clarify that we're calculating a percentage of RAM usage.
of used: Specify which dimension (used) on the system.ram chart you want to monitor with this entity.
AFTER
a relative number of seconds, but it also accepts a single letter for changing the units,
like -1s = 1 second in the past, -1m = 1 minute in the past, -1h = 1 hour in the past
OPTIONS
space separated list of percentage, absolute, min2max, unaligned, match-ids, match-names
i.e.
lookup: average -10m unaligned of user,system,softirq,irq,guest
units:
"calc:" 回來的值的 units
every:
How often to perform the lookup calculation to decide whether or not to trigger this alarm.
warn/crit:
The value at which Netdata should trigger a warning or critical alarm.
warn: EXPRESSION
i.e.
warn: $this > 80 crit: $this >= 90
conditional evaluation operator "?"
The conditional evaluation operator ? is supported too.
Using this operator IF-THEN-ELSE conditional statements can be specified.
The format is: (condition) ? (true expression) : (false expression).
hysteresis(":")
warn: $this > (($status >= $WARNING) ? (75) : (85))
it will trigger a warning the first time it goes above 85,
but will remain a warning until it goes below 75 (or goes above 85).
percentage
instead of returning the value, calculate the percentage of the sum of the selected dimensions,
versus the sum of all the dimensions of the chart. This also sets the units to %.
absolute or abs, turn all values positive and then sum them.
min2max, when multiple dimensions are given, do not sum them, but take their max - min
special variables
$this, which is resolved to the value of the current alarm.
$status, which is resolved to the current status of the alarm
This values can be compared with
$REMOVED, $UNINITIALIZED, $UNDEFINED, $CLEAR, $WARNING, $CRITICAL.
These values are incremental, ie. $status > $CLEAR works as expected.
$now, which is resolved to current unix timestamp.
unaligned
when data are reduced / aggregated (e.g. the request is about the average of the last minute, or hour),
Netdata by default aligns them so that the charts will have a constant shape
(so average per minute returns always XX:XX:00 - XX:XX:59).
Setting the unaligned option, Netdata will aggregate data without any alignment,
so if the request is for 60 seconds, it will aggregate the latest 60 seconds of collected data.
calc:
A calculation to apply to the value found via lookup or another variable.
green/red:
Set the green and red thresholds of a chart.
Both are available as $green and $red in expressions.
These will eventually visualized on the dashboard.
exec:
The script to execute when the alarm changes status.
repeat:
Format: repeat: [off] [warning DURATION] [critical DURATION]
The interval for sending notifications when an alarm is in WARNING or CRITICAL mode.
This will override the default interval settings inherited from health settings in netdata.conf
(default repeat warning = DURATION and default repeat critical = DURATION)
Use 0s to turn off the repeating notification for WARNING / CRITICAL mode.
ie
repeat: warning 600s critical 600s
delay:
delay: [[[up U] [down D] multiplier M] max X]
up U
defines the delay to be applied to a notification for an alarm that raised its status (i.e. CLEAR to WARNING, CLEAR to CRITICAL, WARNING to CRITICAL). For example, up 10s, the notification for this event will be sent 10 seconds after the actual event. This is used in hope the alarm will get back to its previous state within the duration given. The default U is zero.
mutliplier M
multiplies U and D when every time an alarm changes state, while a notification is delayed.
The default multiplier is 1.0.
ie
delay: down 15m multiplier 1.5 max 1h
info:
A description of the alarm, which will appear in the dashboard and notifications.
Reload health configuration
To make any changes to your health configuration live, you must reload Netdata's health monitoring system.
To do that without restarting all of Netdata, run the following:
killall -USR2 netdata
OR
netdatacli reload-health
netdatacli
netdatacli help
reload-health
ping
Stopping notifications for individual alarms
設定
cd /opt/netdata/etc/netdata
./edit-config health.d/btrfs.conf # call nano to create health.d/btrfs.conf
# To silence this alarm, change sysadmin to silent.
to: silent
# reload
killall -USR2 netdata
Central Netdata server (streaming)
Netdata slaves streaming metrics to upstream Netdata servers(statsd),
use exactly the same protocol local plugins use.
statsd
Port: 8125/TCP, P8125/UDP
statsd is a system to collect data from any application.
Applications are sending metrics to it, usually via non-blocking UDP communication,
and statsd servers collect these metrics,
perform a few simple calculations on them and push them to backend time-series databases.
* Netdata is a fully featured statsd server.
* Netdata statsd is inside Netdata (an internal plugin, running inside the Netdata daemon)
Disable statsd
[statsd] enabled = no
Database Queries
API 入口: /api/v1/data
i.e.
curl -Ss 'http://localhost:19999/api/v1/data?chart=system.cpu'
{ "labels": ["time", "guest_nice", "guest", "steal", "softirq", "irq", "user", "system", "nice", "iowait"], "data": [ [ 1707117060, 0, 0, 0, 0.359386, 0, 0.2032042, 1.1016693, 7.175965, 0.0151144], [ 1707117000, 0, 0, 0, 0.344185, 0, 0.1880425, 1.0241601, 6.838367, 0.0151106], [ 1707116940, 0, 0, 0, 0.3038085, 0, 0.191349, 0.9735301, 6.366551, 0.0117495], [ 1707116880, 0, 0, 0, 0.3712289, 0, 0.2166902, 1.2094336, 7.600954, 0.0100786], [ 1707116820, 0, 0, 0, 0.2635422, 0, 0.1846474, 0.9182012, 5.74757, 0.0016786], [ 1707116760, 0, 0, 0, 0.3324211, 0, 0.2115407, 1.0979971, 7.002669, 0.0184678], [ 1707116700, 0, 0, 0, 0.2955053, 0, 0.2115549, 1.0090834, 6.355043, 0.003358], [ 1707116640, 0, 0, 0, 0.2734991, 0, 0.1929595, 0.899359, 5.882748, 0.0050337], [ 1707116580, 0, 0, 0, 0.2802484, 0, 0.1762041, 0.9833865, 6.5464, 0.0050344], [ 1707116520, 0, 0, 0, 0.2601151, 0, 0.1879542, 0.8441155, 5.838326, 0.0067126] ] }
curl -Ss 'http://localhost:19999/api/v1/data?chart=system.cpu&format=csv&after=-600&points=1&options=percentage'
format
- json # Default
- html
- csv
- ...
after & before
define a time-frame, accepting
points
The number of points to be returned.
If not given, the result will have the same granularity as the database
group
The grouping method to use when reducing the points the database has.
If not given, it defaults to average.
options
Only 2 options are used by the query engine: unaligned and percentage.
All the other options are used by the output formatters.
The default is to return aligned data.
units
netdata has hard-coded units
- disk I/O is in kilobytes/s
- disk size is in GB
- memory size is in MB
- network bandwidth in is kilobits/s
- temperatures in Celcius
Export and Import a Snapshot
Snapshots can be incredibly useful for diagnosing anomalies after they've already happened.
Let's say Netdata triggered an alarm while you were sleeping.
The generated snapshot will include all charts of this dashboard, for the visible timeframe
To export a snapshot, click on the export icon.
The snapshot will be downloaded as a file, to your computer,
that can be imported back into any netdata dashboard (no need to import it back on this server).
Performance Tuning
CPU
[web] enable gzip compression = no
Disable logs
[logs] debug log = /opt/netdata/var/log/netdata/debug.log error log = /opt/netdata/var/log/netdata/error.log # access log = /opt/netdata/var/log/netdata/access.log access log = none
停用 plugins
[plugins] # memu 內的 User Groups / Users / Applications apps = no [plugin:proc:/proc/stat] cpu utilization = yes per cpu core utilization = no core_throttle_count = no # 1 cpu frequency = no [plugin:cgroups] enable systemd services = no
說明
1) cpu throttling
CPU throttling refers to the process of dynamically adjusting the CPU frequency or performance based on the system's workload.
Netdata 的自身 Info.
How many metrics, on average, do your Agents collect?
Dashboard 的右下角有寫
Every 5 seconds, Netdata collects 3,387 metrics on home, presents them in 654 charts and monitors them with 268 alarms. netdata v1.28.0
What is your compression savings ratio?
Search "dbengine_compression_ratio" on dashboard (Netdata Monitoring / dbengine)
Typical compression ratio of 80%
Disk Usage (1 day)
假設
多久收集 metrics (update every) = 5 每次收集幾多 metrics = 2000 每 metrics 佔用空間 = 4 bytes 在沒有壓縮情況下 => (2000 * 3600 * 24 * 4 / 1)/(1024^2) = 659 MB
Monitoring Kernel Memory de-duplication performance
Netdata will create charts for kernel memory de-duplication - deduper (ksm)
- mem.ksm,
- mem.ksm_savings (savings/offered),
- mem.ksm_ratios (%)
Config
[plugin:proc] /sys/kernel/mm/ksm = yes [plugin:proc:/sys/kernel/mm/ksm] /sys/kernel/mm/ksm/pages_shared = /sys/kernel/mm/ksm/pages_shared /sys/kernel/mm/ksm/pages_sharing = /sys/kernel/mm/ksm/pages_sharing /sys/kernel/mm/ksm/pages_unshared = /sys/kernel/mm/ksm/pages_unshared /sys/kernel/mm/ksm/pages_volatile = /sys/kernel/mm/ksm/pages_volatile
Custom Page
# 由那裡拎 Data
<script> // this section has to appear before loading dashboard.js var netdataTheme = 'slate'; // this is dark // the default is the server that dashboard.js is downloaded from. // var netdataServer = 'http://my.server:19999/'; </script>
# 設定
<script> // This has to be done, after dashboard.js is loaded // true(default) = “on focus”; false = “always” NETDATA.options.current.stop_updates_when_focus_is_lost = false; // lower the pressure on this browser // controls the number of concurrent data collection threads // Each thread is responsible for collecting data from a specific data source // (e.g., a plugin or a system metric). NETDATA.options.current.concurrent_refreshes = false; // if the tv browser is too slow (a pi?) set this to false // enables parallel data collection to improve performance NETDATA.options.current.parallel_refresher = true; </script>
# 擺位
<div style="width: 100%; text-align: center;"> ... </div>
Dygraph
Y-Axis for Dygraph
The min and max values of the y-axis using data-dygraph-valuerange="MIN, MAX"
V1.37
OS: Rocky 8 # el8
[db] update every = 2 dbengine page cache size MB = 64 dbengine disk space MB = 256 mode = dbengine
To minimize resource utilization and should only be considered on Parent - Child setups
mode = dbengine 只用於 Parent & Child 在同一架機. (single node setup)
[db] retention = 3600
"retention" controls the size of the database in memory (except for [db].mode = dbengine)
"[db].update every = 2" AND "[db].retention = 1800" => 1 hr data
Settings
[web] enable gzip compression = no disconnect idle clients after seconds = 3600 bind to = 0.0.0.0 # allow connections from = localhost * allow connections from = localhost 192.168.123.0/24 [ml] enabled = no [health] enabled = no [logs] # access = /var/log/netdata/access.log access = none [plugins] debugfs = no [plugin:proc] /proc/net/sockstat6 = no [plugin:proc:/proc/net/snmp6] filename to monitor = none [plugin:proc:/proc/net/dev:lo] enabled = no