netdata

由 datahunter 在二, 13/08/2019 - 17:03 發表

最後更新: 2024-06-02

Installation
Anonymous Statistics
Upgrade
Usage
Configure
Database
Nginx
Apache Reverse Proxy Settings
Netdata registry
Web Setting
Plugins
Disable IPv6 Metrics
edit-config
Health monitoring
+
Stopping notifications for individual alarms
Central Netdata server (streaming)
+ statsd
Database Queries
Export and import a snapshot
Performance Tuning
Netdata 的自身 Info.
Monitoring Kernel Memory de-duplication performance
Custom Page
V1.37

介紹

A real-time performance and health monitoring solution

Designed to:

Solve the centralization problem of monitoring (Scales to infinity)
Replace the console for performance troubleshooting

HomePage: https://www.netdata.cloud/

特點

1s granularity
Zero disk I/O (預設所有資料存放在 RAM)

Port:

19999/tcp # Web panel
8125/TCP, 8125/UDP # statsd

運作:

           Query
             |
Collect -> Store -> Stream -> Archive
             |
           Check

positive / negative values

positive values: read, input, inbound, received
negative values: write, output, outbound, sent

monitoring agent

metrics collector
time-series database
metrics visualizer
alarms notification engine

Security Design

Netdata daemon runs as a normal system user
plugins perform a hard coded data collection job
plugins & Netdata slaves unidirectional: from the plugin towards the Netdata daemon
dashboards are read-only
data do not leave the server where they are collected
Netdata servers do not talk to each other
your browser connects all the Netdata servers

Installation

netdata 一共有 4 重安裝方式

Linux distribution Binary Packages
Linux 64bit pre-built static binary
Run Netdata in a Docker container
Install Netdata on Linux manually

方法 1: Binary Packages (Linux Distribution)

# U22 (netdat v1.33)

# Rocky8 (netdata v1.36.1)

dnf install netdata

systemctl enable netdata --now

Checking

netstat -ntlp | grep netdata

tcp        0      0 127.0.0.1:8125          0.0.0.0:*               LISTEN      22515/netdata
tcp        0      0 0.0.0.0:19999           0.0.0.0:*               LISTEN      22515/netdata

方法 2: Static Binary

mkdir /usr/src/netdata

cd /usr/src/netdata

# 查看有什麼 Version

https://github.com/netdata/netdata/releases

V=v1.46.3 #
V=v1.45.6 # 2024-06-05
V=v1.45.5 # 2024-05-21

LINK=https://github.com/netdata/netdata/releases/download/$V/netdata-latest.gz.run

wget $LINK -O netdata.${V}.gz.run

bash netdata.${V}.gz.run

Remark

會安裝在 /opt/netdata
會建立 User Account: 'netdata'

How to check Netdata version in UI

# V1.4.X

go to the Nodes tab (頭頂位置), and click on the Info icon ("i")

Anonymous Statistics

Starting with v1.12 Netdata also collects anonymous statistics on certain events

rpm -ql netdata | grep statistics

/usr/libexec/netdata/plugins.d/anonymous-statistics.sh

Code

if [ -f "/etc/netdata/.opt-out-from-anonymous-statistics" ] ||
  ...
  exit 0
fi

To opt-out from sending anonymous statistics

touch /etc/netdata/.opt-out-from-anonymous-statistics

log

grep anonymous /var/log/netdata/error.log

... netdata INFO  : MAIN : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'OK' '-'

Upgrade

chmod 700 ./netdata-latest.gz.run

# 過程會自動 stop / start netdata

./netdata-latest.gz.run --accept

Usage

Web: http://your.server.ip:19999/

the current charts zooming (SHIFT + mouse wheel over a chart),

the highlighted time-frame (ALT + select an area on a chart),

Auto-detection of data collection sources

This auto-detection process happens only once, when Netdata starts.

Exceptions: Containers and VMs are auto-detected forever

Configure

Get running config:

http://127.0.0.1:19999/netdata.conf

Config File Location:

/opt/netdata/etc/netdata/netdata.conf

獲得當時 config file(All in one file)

wget -O netdata.conf http://localhost:19999/netdata.conf

RAM & CPU Usage

[global]
  # Enable KSM to half Netdata memory requirement
  history = 3600
  update every = 1
  process scheduling policy = idle

Notes

update every For data collection. Default: 1

Memory modes (DB)

map # data are in memory mapped files (swap)
dbengine(Default)
ram # data are purely in memory. Data are never saved on disk.
save # data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata restart.
none # without a database (collected metrics can only be streamed to another Netdata)

[global]
  memory mode = dbengine

map

Data are in memory mapped files. This works like the swap.
(constant write on your disk, does not support KSM)

For each chart, Netdata maps the following files:

chart/main.db # chart information.
# Every time data are collected for a chart, this is updated.
chart/dimension_name.db # round robin database

dbengine

The data are in database files.

Files

# Debian: /var/cache/netdata/dbengine

ls /opt/netdata/var/cache/netdata/dbengine

datafile-1-0000000001.ndf
journalfile-1-0000000001.njf
datafile-1-0000000002.ndf
journalfile-1-0000000002.njf    <- number 愈大愈新
...

Size of each datafile is determined automatically by Netdata. (4MB ~ 512MB)

Netdata will decide a datafile size trying to maintain about 50 datafiles for the whole database

----

njf = journal file v1

holds information about the transactions in its datafile (4KB)

----

There is some amount of RAM dedicated to data caching and indexing

# Unit: MiB
page cache size = 32

The number of history entries is not fixed (depends on the configured disk space)

"dbengine" is the only mode that supports changing "update_every" without losing the previously stored metrics

"history" configuration option is meaningless for "memory mode = dbengine"

Suggest to use this mode on nodes that also run other applications

Database Engine uses direct I/O to avoid polluting the OS filesystem caches

# Unit: MiB
dbengine disk space = 256

----

The DB engine stores chart metric values in 4k pages in memory.

Each chart dimension gets its own page to store consecutive values generated from the data collectors.

When those pages fill up they are slowly compressed and flushed to disk.

=> 亦即是每 17 min. flush 一次

# 每類 chart 的 cache = 4 kbyte, 每隻 record 4 bytes, 在每秒 get 一次的情況下, 1024 秒就 full

4096 / 4 = 1024 sec (dimension: 1s)

When the disk quota is exceeded the oldest values are removed from the DB engine at real time

* When we query the DB engine for data

=> trigger disk read I/O requests that fill the Page Cache with the requested pages

* The Database Engine uses direct I/O to avoid polluting the OS filesystem caches.

----

Config

/opt/netdata/etc/netdata/netdata.conf

[global]
    memory mode = dbengine
    # Unit: MiB
    page cache size = 32
    dbengine disk space = 256

* There is one DB engine instance per Netdata host/node

* All DB engine instances, for localhost and all other streaming recipient nodes inherit their configuration from netdata.conf

* There are explicit memory requirements per DB engine instance

File descriptor

The Database Engine may keep a significant amount of files open per instance

(at least 50 file descriptors available per dbengine instance)

systemctl edit netdata

[Service]
LimitNOFILE=65536

Remark: /etc/sysctl.conf: "fs.file-max = 65536"

ram

data are purely in memory. Data are never saved on disk. (Supports KSM)

save (the default)

Data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata restart.

It also uses mmap() and supports KSM.

Performance

KSM

Disable data collection

disable data collection plugins that you don't need => Save both CPU and RAM

e.g.

Disable IPv6 Metrics

Web Panel 右下角 metrics 及 charts 的數量會變小了

Every second, Netdata collects 1,560 metrics on SERVER, presents them in 320 charts ...

OOM Score

[global]
    OOM score = 1000

Higher => This means Netdata will be the first to be killed when your server runs out of memory.

Checking

cat /proc/$(pidof netdata)/oom_score

Scheduling Policy

[global]
  process scheduling policy = idle

By default Netdata runs with the idle process scheduling policy,

so that it uses CPU resources, only when there is idle CPU to spare.

Database

Ram Usage (for DB)

The default history is 3600 entries,

If data collection frequency is set to 1 second. You will have just one hour of data.

[global]
    history = 3600

It will need 14.4KB for each chart dimension (4 bytes for the value * the entries of its history(3600))

If you need 1000 dimensions, they will occupy just 14.4MB.

KSM

Netdata offers all its round robin database to kernel for deduplication

KSM is a solution that will provide 60+% memory savings to Netdata.

# by default 0; 1 for the kernel to spawn ksmd

echo 1 > /sys/kernel/mm/ksm/run

Tiers

storage tiers = 3
update every = 1
dbengine tier 1 update every iterations = 60
dbengine tier 2 update every iterations = 60

i.e.

If a metric is collected per second(update every) in Tier 0,
then we will have a data point every minute in tier 1 and every hour in tier 2

Retention

The general rule is that Netdata needs about 1 byte per data point on disk for tier 0,
and 4 bytes per data point on disk for tier 1 and above.

* dbengine disk space MB (deprecated)

dbengine multihost disk space MB = 256
dbengine tier 1 update every iterations = 60
dbengine tier 2 multihost disk space MB = 64

cache = /var/cache/netdata

dbengine/
dbengine-tier1/
dbengine-tier2/

1000 metrics/second

3 days
22 days
2 years

Cache

dbengine page cache size MB = 32
dbengine tier 2 page cache size MB = 8
dbengine tier 1 page cache size MB = 16

Memory for concurrently collected metrics

DBENGINE memory in KiB

METRICS x (TIERS - 1) x 4KiB x 2 + "dbengine page cache size MB"

* TIERS By default 3 ( -1 when using 3+ tiers)

i.e.

# 3 storage tiers & 2k metrics

2000 x 3 x 4 x 2 / 1024 MiB ~ 47 MiB

dbengine page cache size MB = 32 MiB (Default)

Total Netdata memory in MiB = "Metric cardinality factor" x "DBENGINE memory" + "dbengine page cache"

The cardinality factor is usually between 3 or 4 and depends mainly on the ephemerality of the collected metrics.
The more ephemeral the infrastructure, the higher the factor.

nginx

IP Level ACL

* best and the suggested way to protect Netdata

=> Expose Netdata only in a private LAN => IP Level

[web]
    bind to = 10.1.1.1:19999 localhost:19999

Login with username & password

Use web server to provide authentication (in front of all your Netdata servers)

Web Server Setting (nginx)

Nginx to forward requests to netdata

HTTP auth file: /etc/nginx/netdata.users

URL: https://your-server/netdata/

# Running netdata as a subfolder to an existing virtual host

server {
    ...
    include /etc/nginx/templates/netdata.tmpl;
}

netdata.tmpl

location = /status {
    return 301 /status/;
}

location ~ /status/(?<ndpath>.*) {
    #access_log      /var/log/nginx/netdata.log;
    #error_log       /var/log/nginx/netdata.log;

    proxy_redirect off;
    proxy_set_header Host $host;

    proxy_set_header X-Forwarded-Host $host;
    proxy_set_header X-Forwarded-Server $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_http_version 1.1;
    proxy_pass_request_headers on;
    proxy_set_header Connection "keep-alive";
    proxy_store off;
    proxy_pass http://127.0.0.1:19999/$ndpath$is_args$args;

    gzip on;
    gzip_proxied any;
    gzip_types *;

    auth_basic "Authentication Required";
    auth_basic_user_file /etc/nginx/netdata.users;
}

netdata 在 proxy 後 Settings

[web]
    enable gzip compression = no
    
[logs]
    # access = /var/log/netdata/access.log
    access = off

[web]
    #bind to = localhost
    bind to = unix:/var/run/netdata/netdata.sock

當使用 unix socket 時

nginx

#proxy_pass http://127.0.0.1:19999/$ndpath$is_args$args;
proxy_pass http://unix:/var/run/netdata/netdata.sock:/$ndpath$is_args$args;

Apache Reverse Proxy Settings

mod_proxy & mod_proxy_http

vhosts.conf

<VirtualHost *:80>

    ProxyRequests Off
    ProxyPreserveHost On

    <Proxy *>
        Require all granted
    </Proxy>

    ProxyPass "/netdata/" "http://localhost:19999/" \
               connectiontimeout=5 timeout=30 keepalive=on
    ProxyPassReverse "/netdata/" "http://localhost:19999/"

    # if the user did not give the trailing /
    RewriteEngine On
    RewriteRule ^/netdata$ http://%{HTTP_HOST}/netdata/ [L,R=301]

    # add a <Location /netdata/> section
    <Location /netdata/>
        AuthType Basic
        AuthName "Protected site"
        AuthUserFile htpasswd
        Require valid-user
        Require all denied
    </Location>
    
</VirtualHost>

Netdata registry

registry = node menu (on top left corner of the Netdata dashboards)

目的:

enables the Netdata cloud features, such as the node view
multiple Netdata are integrated into one distributed application (distributed monitoring)

The registry keeps track of 4 entities:

machine_guid: a random GUID generated by each Netdata (first time it starts)
person_guid: the web browsers accessing the Netdata installations (first time it sees a new web browser)
URLs of Netdata installations
accounts: i.e. the information used to sign-in via one of the available sign-in methods.

Default registry: https://registry.my-netdata.io

Who talks to the registry?

Your web browser only!

Flow

Browser --> netdata

<-URL to Registry-

Browser --> Registry

Run your own registry

# Server (registry)

* Every Netdata can be a registry

[registry]
    enabled = yes
    registry to announce = http://your.registry:19999
    allow from = 192.168.123.*

Remark

(1) registry 有 DB database: /var/lib/netdata/registry/*.db

registry-log.db, the transaction log
registry.db, the database

(2) IPs allowed by [registry].allow from should also be allowed by [web].allow connection from.

# Client (netdata)

Advertise it to registry

[registry]
    enabled = no
    registry to announce = http://your.registry:19999

Web Setting

Disable Web Dashboard

[web]
    mode = none

Threads

# The default number of processor threads is min(cpu cores, 6)

[web]
  web server threads = 4
  web server max sockets = 512

Access lists

Netdata supports access lists in netdata.conf:

[web]
    allow connections from = localhost *
    allow dashboard from = localhost *
    allow badges from = *
    allow streaming from = *
    allow netdata.conf from = localhost
    allow management from = localhost

說明

allow badges from

checks if the API request is for a badge. Badges are not matched by allow dashboard from.

allow netdata.conf from

checks the IP to allow http://netdata.host:19999/netdata.conf

IPs allowed by allow netdata.conf from should also be allowed by allow connections from

Plugins

Internal, External, Modular plugins

Internal data collection plugins (running inside the Netdata daemon)
External data collection plugins (independent processes, sending data to Netdata over pipes)
Modular plugin orchestrators (external plugins that have multiple data collection modules)

netdata.conf

Disable a plug-in

在 config folder node.d python.d ..

[plugins]
    proc = yes
    diskspace = yes
    ...
    node.d = yes

Per plug-in setting

[plugin:python.d]
        # update every = 5

獲得現在原整 Settings

cd /etc/netdata

mv netdata.{conf,bak}

wget -O netdata.conf http://localhost:19999/netdata.conf

Disable IPv6 Metrics

修改 netdata.conf

[plugin:proc]
  /proc/net/sockstat6 = no

[plugin:proc:/proc/net/snmp6]
        filename to monitor = none

systemctl restart netdata

Disable check 某些 NIC

[plugin:proc:/proc/net/dev:lxcbr0]
        enabled = no

"diskspace" Settings

Disabling performance metrics

# For individual device

[plugin:proc:/proc/diskstats:sda]
    enable performance metrics = no

# Disable By path / filesystem

* disable data collection plugins that you don't need => Save both CPU and RAM

[plugin:proc:diskspace]
    exclude space metrics on paths = /tmp /dev/* /run/* /var/*
    exclude space metrics on filesystems = *sshfs fusectl autofs

edit-config

此 command 會將 Stock config COPY 到 User config

Stock config files at: '/opt/netdata/usr/lib/netdata/conf.d'
User config files at: '/opt/netdata/etc/netdata'

USAGE:

./edit-config [options] [FILENAME]

Get a list of known config files

./edit-config --list

e.g.

./edit-config go.d.conf

之後在 Editor 內修改

modules:
...
logind: no

Health monitoring

Disable monitoring

/opt/netdata/etc/netdata/netdata.conf

[health]
    # Default: yes
    enabled = no

/opt/netdata/etc/netdata/edit-config health_alarm_notify.conf

SEND_EMAIL="YES"

* Default alarms shipped with Netdata.

Alerm to multi mailbox

health_alarm_notify.conf

# to receive only critical alarms, set it to "root|critical"
# 沒有 "to:" 時發比誰
DEFAULT_RECIPIENT_EMAIL="[email protected] [email protected]|critical"

...

# Alert "to: sysadmin" 時發 mail 比誰
role_recipients_email[sysadmin]="${DEFAULT_RECIPIENT_EMAIL}"

Testing Notifications

# become user netdata

su -s /bin/bash netdata

# enable debugging info on the console

export NETDATA_ALARM_NOTIFY_DEBUG=1

# send test alarms to sysadmin

/opt/netdata/usr/libexec/netdata/plugins.d/alarm-notify.sh test

...
--- BEGIN sendmail command ---
/usr/sbin/sendmail -t
--- END sendmail command ---
2021-01-07 18:23:17: alarm-notify.sh: 
 INFO: sent email notification for: 
  hypervisor.datahunter.org test.chart.test_alarm is CLEAR to '[email protected]'
# OK

Stop notifications for individual alarms (silencing the alarm)

Step1: Find the alarm configuration file

e.g.

/opt/netdata/usr/lib/netdata/conf.d/health.d/net.conf

Step2: Edit the file to enable silencing

to: sysadmin

改成

to: silent

Example

NIC Full Loading

    alarm: 5m_sent_traffic_overflow
       on: net.ens192
       os: linux
    hosts: *
 families: *
   #lookup: average -5m unaligned absolute of received
   lookup: average -5m unaligned absolute of sent
     calc: ($interface_speed > 0) ? ($this * 100 / (100 * 1000)) : ( nan )
    units: %
    every: 60s
     warn: $this > (($status >= $WARNING)  ? (80) : (85))
     crit: $this > (($status == $CRITICAL) ? (85) : (90))
    delay: down 1m multiplier 1.5 max 1h
     info: interface sent bandwidth usage over net device speed max
       to: sysadmin

content in alerm email

[ $this = 94.079845 ] [ $status = 1 ] [ $CRITICAL = 4 ]

# CPU Usage

template: 10min_cpu_usage
      on: system.cpu
      os: linux
   hosts: *
  lookup: average -10m unaligned of user,system,softirq,irq,guest
   units: %
   every: 1m
    warn: $this > (($status >= $WARNING)  ? (75) : (85))
    crit: $this > (($status == $CRITICAL) ? (85) : (95))
   delay: down 15m multiplier 1.5 max 1h
    info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
      to: sysadmin

# 用 variable

 template: 1m_received_traffic_overflow
       on: net.net
       os: linux
    hosts: *
 families: *
   lookup: average -1m unaligned absolute of received
     # $interface_speed 在之前的 template 定義出來
     calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed * 1000)) : ( nan )
    units: %
    every: 10s
     warn: $this > (($status >= $WARNING)  ? (80) : (85))
     crit: $this > (($status == $CRITICAL) ? (85) : (90))
    delay: down 1m multiplier 1.5 max 1h
     info: interface received bandwidth usage over net device speed max
       to: sysadmin

alarm vs template

Alarms

It attached to specific charts and use the alarm label. (net.eth0)

Alarms have higher precedence and will override templates.

If an alarm and template entity have the same name and attach to the same chart, Netdata will use the alarm.

Need to find the context? Hover over the date on any given chart and look at the tooltip.

Templates

define rules that apply to all charts of a specific context(net.net), and use the template label.

Templates help you apply one entity to all disks, all network interfaces, all MySQL databases, and so on.

解說

on:

Which chart the entity listens to

lookup:

This line makes a database lookup to find a value. This result of this lookup is available as $this

lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS] [foreach DIMENSIONS]

METHOD

one of average, min, max, sum, incremental-sum

average:

Calculate the average of all the metrics collected.

percentage:

Clarify that we're calculating a percentage of RAM usage.

of used: Specify which dimension (used) on the system.ram chart you want to monitor with this entity.

AFTER

a relative number of seconds, but it also accepts a single letter for changing the units,

like -1s = 1 second in the past, -1m = 1 minute in the past, -1h = 1 hour in the past

OPTIONS

space separated list of percentage, absolute, min2max, unaligned, match-ids, match-names

i.e.

lookup: average -10m unaligned of user,system,softirq,irq,guest

units:

"calc:" 回來的值的 units

every:

How often to perform the lookup calculation to decide whether or not to trigger this alarm.

warn/crit:

The value at which Netdata should trigger a warning or critical alarm.

warn: EXPRESSION

i.e.

  warn: $this > 80
  crit: $this >= 90

conditional evaluation operator "?"

The conditional evaluation operator ? is supported too.

Using this operator IF-THEN-ELSE conditional statements can be specified.

The format is: (condition) ? (true expression) : (false expression).

hysteresis(":")

warn: $this > (($status >= $WARNING)  ? (75) : (85))

it will trigger a warning the first time it goes above 85,
but will remain a warning until it goes below 75 (or goes above 85).

percentage

instead of returning the value, calculate the percentage of the sum of the selected dimensions,
versus the sum of all the dimensions of the chart. This also sets the units to %.

absolute or abs, turn all values positive and then sum them.

min2max, when multiple dimensions are given, do not sum them, but take their max - min

special variables

$this, which is resolved to the value of the current alarm.

$status, which is resolved to the current status of the alarm

This values can be compared with

$REMOVED, $UNINITIALIZED, $UNDEFINED, $CLEAR, $WARNING, $CRITICAL.

These values are incremental, ie. $status > $CLEAR works as expected.

$now, which is resolved to current unix timestamp.

unaligned

when data are reduced / aggregated (e.g. the request is about the average of the last minute, or hour),
Netdata by default aligns them so that the charts will have a constant shape
(so average per minute returns always XX:XX:00 - XX:XX:59).
Setting the unaligned option, Netdata will aggregate data without any alignment,
so if the request is for 60 seconds, it will aggregate the latest 60 seconds of collected data.

calc:

A calculation to apply to the value found via lookup or another variable.

green/red:

Set the green and red thresholds of a chart.

Both are available as $green and $red in expressions.

These will eventually visualized on the dashboard.

exec:

The script to execute when the alarm changes status.

repeat:

Format: repeat: [off] [warning DURATION] [critical DURATION]

The interval for sending notifications when an alarm is in WARNING or CRITICAL mode.
This will override the default interval settings inherited from health settings in netdata.conf
(default repeat warning = DURATION and default repeat critical = DURATION)
Use 0s to turn off the repeating notification for WARNING / CRITICAL mode.

repeat: warning 600s critical 600s

delay:

delay: [[[up U] [down D] multiplier M] max X]

up U

defines the delay to be applied to a notification for an alarm that raised its status (i.e. CLEAR to WARNING, CLEAR to CRITICAL, WARNING to CRITICAL). For example, up 10s, the notification for this event will be sent 10 seconds after the actual event. This is used in hope the alarm will get back to its previous state within the duration given. The default U is zero.

mutliplier M

multiplies U and D when every time an alarm changes state, while a notification is delayed.

The default multiplier is 1.0.

delay: down 15m multiplier 1.5 max 1h

info:

A description of the alarm, which will appear in the dashboard and notifications.

Reload health configuration

To make any changes to your health configuration live, you must reload Netdata's health monitoring system.

To do that without restarting all of Netdata, run the following:

killall -USR2 netdata

netdatacli reload-health

netdatacli

netdatacli help

reload-health

ping

Stopping notifications for individual alarms

設定

cd /opt/netdata/etc/netdata

./edit-config health.d/btrfs.conf # call nano to create health.d/btrfs.conf

# To silence this alarm, change sysadmin to silent.

to: silent

# reload

killall -USR2 netdata

Central Netdata server (streaming)

Netdata slaves streaming metrics to upstream Netdata servers(statsd),

use exactly the same protocol local plugins use.

statsd

Port: 8125/TCP, P8125/UDP

statsd is a system to collect data from any application.

Applications are sending metrics to it, usually via non-blocking UDP communication,

and statsd servers collect these metrics,

perform a few simple calculations on them and push them to backend time-series databases.

* Netdata is a fully featured statsd server.

* Netdata statsd is inside Netdata (an internal plugin, running inside the Netdata daemon)

Disable statsd

[statsd]
  enabled = no

Database Queries

API 入口: /api/v1/data

i.e.

curl -Ss 'http://localhost:19999/api/v1/data?chart=system.cpu'

{
 "labels": ["time", "guest_nice", "guest", "steal", "softirq", "irq", "user", "system", "nice", "iowait"],
    "data":
 [
      [ 1707117060, 0, 0, 0, 0.359386, 0, 0.2032042, 1.1016693, 7.175965, 0.0151144],
      [ 1707117000, 0, 0, 0, 0.344185, 0, 0.1880425, 1.0241601, 6.838367, 0.0151106],
      [ 1707116940, 0, 0, 0, 0.3038085, 0, 0.191349, 0.9735301, 6.366551, 0.0117495],
      [ 1707116880, 0, 0, 0, 0.3712289, 0, 0.2166902, 1.2094336, 7.600954, 0.0100786],
      [ 1707116820, 0, 0, 0, 0.2635422, 0, 0.1846474, 0.9182012, 5.74757, 0.0016786],
      [ 1707116760, 0, 0, 0, 0.3324211, 0, 0.2115407, 1.0979971, 7.002669, 0.0184678],
      [ 1707116700, 0, 0, 0, 0.2955053, 0, 0.2115549, 1.0090834, 6.355043, 0.003358],
      [ 1707116640, 0, 0, 0, 0.2734991, 0, 0.1929595, 0.899359, 5.882748, 0.0050337],
      [ 1707116580, 0, 0, 0, 0.2802484, 0, 0.1762041, 0.9833865, 6.5464, 0.0050344],
      [ 1707116520, 0, 0, 0, 0.2601151, 0, 0.1879542, 0.8441155, 5.838326, 0.0067126]
  ]
}

curl -Ss 'http://localhost:19999/api/v1/data?chart=system.cpu&format=csv&after=-600&points=1&options=percentage'

format

json # Default
html
csv
...

after & before

define a time-frame, accepting

points

The number of points to be returned.

If not given, the result will have the same granularity as the database

group

The grouping method to use when reducing the points the database has.

If not given, it defaults to average.

options

Only 2 options are used by the query engine: unaligned and percentage.

All the other options are used by the output formatters.

The default is to return aligned data.

units

netdata has hard-coded units

disk I/O is in kilobytes/s
disk size is in GB
memory size is in MB
network bandwidth in is kilobits/s
temperatures in Celcius

Export and Import a Snapshot

Snapshots can be incredibly useful for diagnosing anomalies after they've already happened.

Let's say Netdata triggered an alarm while you were sleeping.

The generated snapshot will include all charts of this dashboard, for the visible timeframe

To export a snapshot, click on the export icon.

The snapshot will be downloaded as a file, to your computer,

that can be imported back into any netdata dashboard (no need to import it back on this server).

Performance Tuning

CPU

[web]
    enable gzip compression = no

Disable logs

[logs]
    debug log = /opt/netdata/var/log/netdata/debug.log
    error log = /opt/netdata/var/log/netdata/error.log
    # access log = /opt/netdata/var/log/netdata/access.log
    access log = none

停用 plugins

[plugins]
    # memu 內的 User Groups / Users / Applications
    apps = no

[plugin:proc:/proc/stat]
    cpu utilization = yes
    per cpu core utilization = no
    core_throttle_count = no            # 1
    cpu frequency = no
    
[plugin:cgroups]
    enable systemd services = no

說明

1) cpu throttling

CPU throttling refers to the process of dynamically adjusting the CPU frequency or performance based on the system's workload.

Netdata 的自身 Info.

How many metrics, on average, do your Agents collect?

Dashboard 的右下角有寫

Every 5 seconds, Netdata collects 3,387 metrics on home, 
presents them in 654 charts and monitors them with 268 alarms.
 
netdata
v1.28.0

What is your compression savings ratio?

Search "dbengine_compression_ratio" on dashboard (Netdata Monitoring / dbengine)

Typical compression ratio of 80%

Disk Usage (1 day)

假設

多久收集 metrics (update every) = 5
每次收集幾多 metrics = 2000
每 metrics 佔用空間 = 4 bytes
在沒有壓縮情況下
 => (2000 * 3600 * 24 * 4 / 1)/(1024^2) = 659 MB

Monitoring Kernel Memory de-duplication performance

Netdata will create charts for kernel memory de-duplication - deduper (ksm)

mem.ksm,
mem.ksm_savings (savings/offered),
mem.ksm_ratios (%)

Config

[plugin:proc]
    /sys/kernel/mm/ksm = yes

[plugin:proc:/sys/kernel/mm/ksm]
    /sys/kernel/mm/ksm/pages_shared = /sys/kernel/mm/ksm/pages_shared
    /sys/kernel/mm/ksm/pages_sharing = /sys/kernel/mm/ksm/pages_sharing
    /sys/kernel/mm/ksm/pages_unshared = /sys/kernel/mm/ksm/pages_unshared
    /sys/kernel/mm/ksm/pages_volatile = /sys/kernel/mm/ksm/pages_volatile

Custom Page

# 由那裡拎 Data

<script>
    // this section has to appear before loading dashboard.js
    var netdataTheme = 'slate'; // this is dark
    // the default is the server that dashboard.js is downloaded from.
    // var netdataServer = 'http://my.server:19999/';
</script>

# 設定

<script>
    // This has to be done, after dashboard.js is loaded
    
    // true(default) = “on focus”; false = “always”
    NETDATA.options.current.stop_updates_when_focus_is_lost = false;

    // lower the pressure on this browser
    // controls the number of concurrent data collection threads
    // Each thread is responsible for collecting data from a specific data source
    // (e.g., a plugin or a system metric).
    NETDATA.options.current.concurrent_refreshes = false;

    // if the tv browser is too slow (a pi?) set this to false
    // enables parallel data collection to improve performance
    NETDATA.options.current.parallel_refresher = true;
</script>

# 擺位

<div style="width: 100%; text-align: center;">
    ...
</div>

Dygraph

Y-Axis for Dygraph

The min and max values of the y-axis using data-dygraph-valuerange="MIN, MAX"

V1.37

OS: Rocky 8 # el8

[db]
update every = 2
dbengine page cache size MB = 64
dbengine disk space MB = 256
mode = dbengine

To minimize resource utilization and should only be considered on Parent - Child setups
mode = dbengine 只用於 Parent & Child 在同一架機. (single node setup)

[db]
retention = 3600

"retention" controls the size of the database in memory (except for [db].mode = dbengine)

"[db].update every = 2" AND "[db].retention = 1800" => 1 hr data

Settings

[web]
    enable gzip compression = no
    disconnect idle clients after seconds = 3600
    bind to = 0.0.0.0
    # allow connections from = localhost *
    allow connections from = localhost 192.168.123.0/24

[ml]
    enabled = no

[health]
    enabled = no

[logs]
    # access = /var/log/netdata/access.log
    access = none

[plugins]
    debugfs = no

[plugin:proc]
    /proc/net/sockstat6 = no

[plugin:proc:/proc/net/snmp6]
    filename to monitor = none

[plugin:proc:/proc/net/dev:lo]
    enabled = no

瀏覽次數： 2143

夢想家

netdata

目錄

介紹

Installation

Anonymous Statistics

Upgrade

Usage

Configure

Database

nginx

Apache Reverse Proxy Settings

Netdata registry

Web Setting

Plugins

獲得現在原整 Settings

Disable IPv6 Metrics

Disable check 某些 NIC

"diskspace" Settings

edit-config

Health monitoring

netdatacli

Stopping notifications for individual alarms

Central Netdata server (streaming)

statsd

Database Queries

Export and Import a Snapshot

Performance Tuning

Netdata 的自身 Info.

Monitoring Kernel Memory de-duplication performance

Custom Page

V1.37