netdata

最後更新: 2019-08-13

介紹

real-time performance and health monitoring solution

Designed to:

  • Solve the centralization problem of monitoring (Scales to infinity)
  • Replace the console for performance troubleshooting

HomePage: http://my-netdata.io

特點

  • 1s granularity
  • Zero disk I/O (預設所有資料放在 RAM)

Port:

19999/tcp                  # Web

8125/TCP, 8125/UDP  # statsd

運作:

           Query
             |
Collect -> Store -> Stream -> Archive
             |
           Check

positive / negative values

  • positive values: read, input, inbound, received
  • negative values: write, output, outbound, sent

monitoring agent

  • metrics collector
  • time-series database
  • metrics visualizer
  • alarms notification engine

Security Design

  • Netdata daemon runs as a normal system user
  • plugins perform a hard coded data collection job
  • plugins & Netdata slaves unidirectional: from the plugin towards the Netdata daemon
  • dashboards are read-only
  • data do not leave the server where they are collected
  • Netdata servers do not talk to each other
  • your browser connects all the Netdata servers

目錄

 


Installation

 

netdata 一共有 4 重安裝方式

  1. Binary Packages
  2. Linux 64bit pre-built static binary
  3. Run Netdata in a Docker container
  4. Install Netdata on Linux manually

2. Binary Packages

# Rocky8 (netdata v1.36.1)

dnf install netdata

systemctl enable netdata --now

Checking

netstat -ntlp | grep netdata

tcp        0      0 127.0.0.1:8125          0.0.0.0:*               LISTEN      22515/netdata
tcp        0      0 0.0.0.0:19999           0.0.0.0:*               LISTEN      22515/netdata

2. Static Binary

mkdir /usr/src/netdata

cd /usr/src/netdata

LINK=https://github.com/netdata/netdata/releases/download/v1.45.3/netdata-latest.gz.run

wget $LINK

chmod 700 ./netdata-latest.gz.run

./netdata-latest.gz.run

Remark

  • 會安裝在 /opt/netdata
  • 會建立 User: 'netdata', Group: 'netdata'

 


Anonymous Statistics

 

Starting with v1.12 Netdata also collects anonymous statistics on certain events

rpm -ql netdata | grep statistics

/usr/libexec/netdata/plugins.d/anonymous-statistics.sh

Code

if [ -f "/etc/netdata/.opt-out-from-anonymous-statistics" ] ||
  ...
  exit 0
fi

To opt-out from sending anonymous statistics

touch /etc/netdata/.opt-out-from-anonymous-statistics

log

grep anonymous /var/log/netdata/error.log

... netdata INFO  : MAIN : /usr/libexec/netdata/plugins.d/anonymous-statistics.sh 'EXIT' 'OK' '-'

 


Upgrade

 

chmod 700 ./netdata-latest.gz.run

# 過程會自動 stop / start netdata

./netdata-latest.gz.run --accept

 


Usage

 

Web: http://your.server.ip:19999/

the current charts zooming (SHIFT + mouse wheel over a chart),

the highlighted time-frame (ALT + select an area on a chart),

Auto-detection of data collection sources

This auto-detection process happens only once, when Netdata starts.

Exceptions: Containers and VMs are auto-detected forever

 


Configure

 

Get running config:

http://127.0.0.1:19999/netdata.conf

Config File Location:

/opt/netdata/etc/netdata/netdata.conf

獲得當時 config file(All in one file)

wget -O netdata.conf http://localhost:19999/netdata.conf

RAM & CPU Usage

[global]
  # Enable KSM to half Netdata memory requirement
  history = 3600
  update every = 1
  process scheduling policy = idle

Notes

update every          For data collection. Default: 1

Memory modes (DB)

  • map        # data are in memory mapped files (swap)
  • dbengine(Default)
  • ram         # data are purely in memory. Data are never saved on disk.
  • save        # data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata restart.
  • none       # without a database (collected metrics can only be streamed to another Netdata)
[global]
  memory mode = dbengine

map

Data are in memory mapped files. This works like the swap.
(constant write on your disk, does not support KSM)

For each chart, Netdata maps the following files:

  • chart/main.db                    # chart information.
                                            # Every time data are collected for a chart, this is updated.
  • chart/dimension_name.db  # round robin database

dbengine

The data are in database files.

Files

# Debian: /var/cache/netdata/dbengine

ls /opt/netdata/var/cache/netdata/dbengine

datafile-1-0000000001.ndf
journalfile-1-0000000001.njf
datafile-1-0000000002.ndf
journalfile-1-0000000002.njf    <- number 愈大愈新
...

Size of each datafile is determined automatically by Netdata. (4MB ~ 512MB)

Netdata will decide a datafile size trying to maintain about 50 datafiles for the whole database

----

njf = journal file v1

holds information about the transactions in its datafile (4KB)

----

There is some amount of RAM dedicated to data caching and indexing

# Unit: MiB
page cache size = 32

The number of history entries is not fixed (depends on the configured disk space)

"dbengine" is the only mode that supports changing "update_every" without losing the previously stored metrics

"history" configuration option is meaningless for "memory mode = dbengine"

Suggest to use this mode on nodes that also run other applications

Database Engine uses direct I/O to avoid polluting the OS filesystem caches

# Unit: MiB
dbengine disk space = 256

----

The DB engine stores chart metric values in 4k pages in memory.

Each chart dimension gets its own page to store consecutive values generated from the data collectors.

When those pages fill up they are slowly compressed and flushed to disk.

 => 亦即是每 17 min. flush 一次

    # 每類 chart 的 cache = 4 kbyte, 每隻 record 4 bytes, 在每秒 get 一次的情況下, 1024 秒就 full

    4096 / 4 = 1024 sec (dimension: 1s)

When the disk quota is exceeded the oldest values are removed from the DB engine at real time

 * When we query the DB engine for data

    => trigger disk read I/O requests that fill the Page Cache with the requested pages

 * The Database Engine uses direct I/O to avoid polluting the OS filesystem caches.

----

Config

/opt/netdata/etc/netdata/netdata.conf

[global]
    memory mode = dbengine
    # Unit: MiB
    page cache size = 32
    dbengine disk space = 256

 * There is one DB engine instance per Netdata host/node

 * All DB engine instances, for localhost and all other streaming recipient nodes inherit their configuration from netdata.conf

 * There are explicit memory requirements per DB engine instance

File descriptor

The Database Engine may keep a significant amount of files open per instance

(at least 50 file descriptors available per dbengine instance)

systemctl edit netdata

[Service]
LimitNOFILE=65536

Remark: /etc/sysctl.conf: "fs.file-max = 65536"

ram

data are purely in memory. Data are never saved on disk. (Supports KSM)

save (the default)

Data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata restart.

It also uses mmap() and supports KSM.

 

Performance

KSM

Disable data collection

disable data collection plugins that you don't need => Save both CPU and RAM

ie.

Disable IPv6 Metrics

Web Panel 右下角 metrics 及 charts 的數量會變小了

Every second, Netdata collects 1,560 metrics on SERVER, presents them in 320 charts ...

OOM Score

[global]
    OOM score = 1000

Higher => This means Netdata will be the first to be killed when your server runs out of memory.

Checking

cat /proc/$(pidof netdata)/oom_score

Scheduling Policy

[global]
  process scheduling policy = idle

By default Netdata runs with the idle process scheduling policy,

so that it uses CPU resources, only when there is idle CPU to spare.

 


Database

 

Ram Usage (for DB)

The default history is 3600 entries,

If data collection frequency is set to 1 second. You will have just one hour of data.

[global]
    history = 3600

It will need 14.4KB for each chart dimension (4 bytes for the value * the entries of its history(3600))

If you need 1000 dimensions, they will occupy just 14.4MB.

KSM

Netdata offers all its round robin database to kernel for deduplication

KSM is a solution that will provide 60+% memory savings to Netdata.

# by default 0; 1 for the kernel to spawn ksmd

echo 1 > /sys/kernel/mm/ksm/run

Tiers

storage tiers = 3
update every = 1
dbengine tier 1 update every iterations = 60
dbengine tier 2 update every iterations = 60

i.e.

If a metric is collected per second(update every) in Tier 0,
then we will have a data point every minute in tier 1 and every hour in tier 2

Retention

The general rule is that Netdata needs about 1 byte per data point on disk for tier 0,
 and 4 bytes per data point on disk for tier 1 and above.

* dbengine disk space MB (deprecated)

dbengine multihost disk space MB = 256
dbengine tier 1 update every iterations = 60
dbengine tier 2 multihost disk space MB = 64

cache = /var/cache/netdata

  • dbengine/
  • dbengine-tier1/
  • dbengine-tier2/

1000 metrics/second

  • 3 days
  • 22 days
  • 2 years

Cache

dbengine page cache size MB = 32
dbengine tier 2 page cache size MB = 8
dbengine tier 1 page cache size MB = 16

Memory for concurrently collected metrics

DBENGINE memory in KiB

    METRICS x (TIERS - 1) x 4KiB x 2 + "dbengine page cache size MB"

* TIERS By default 3 ( -1 when using 3+ tiers)

i.e.

# 3 storage tiers & 2k metrics

2000 x 3 x 4 x 2 / 1024 MiB ~ 47 MiB

dbengine page cache size MB = 32 MiB (Default)

Total Netdata memory in MiB = "Metric cardinality factor" x "DBENGINE memory" + "dbengine page cache"

The cardinality factor is usually between 3 or 4 and depends mainly on the ephemerality of the collected metrics.
The more ephemeral the infrastructure, the higher the factor.

 


nginx

 

IP Level ACL

 * best and the suggested way to protect Netdata

   => Expose Netdata only in a private LAN => IP Level

[web]
    bind to = 10.1.1.1:19999 localhost:19999

username & password

Use web server to provide authentication (in front of all your Netdata servers)

Web Server Setting (nginx)

Nginx to forward requests to netdata

HTTP auth file: /etc/nginx/netdata.users

URL: https://your-server/netdata/

# Running netdata as a subfolder to an existing virtual host

server {
    ...
    include /etc/nginx/templates/netdata.tmpl;
}

netdata.tmpl

location = /status {
    return 301 /status/;
}

location ~ /status/(?<ndpath>.*) {
    proxy_redirect off;
    proxy_set_header Host $host;

    proxy_set_header X-Forwarded-Host $host;
    proxy_set_header X-Forwarded-Server $host;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_http_version 1.1;
    proxy_pass_request_headers on;
    proxy_set_header Connection "keep-alive";
    proxy_store off;
    proxy_pass http://127.0.0.1:19999/$ndpath$is_args$args;

    gzip on;
    gzip_proxied any;
    gzip_types *;

    auth_basic "Authentication Required";
    auth_basic_user_file /etc/nginx/netdata.users;
}

 


Apache Reverse Proxy Settings

 

mod_proxy & mod_proxy_http

vhosts.conf

<VirtualHost *:80>

    ProxyRequests Off
    ProxyPreserveHost On

    <Proxy *>
        Require all granted
    </Proxy>

    ProxyPass "/netdata/" "http://localhost:19999/" \
               connectiontimeout=5 timeout=30 keepalive=on
    ProxyPassReverse "/netdata/" "http://localhost:19999/"

    # if the user did not give the trailing /
    RewriteEngine On
    RewriteRule ^/netdata$ http://%{HTTP_HOST}/netdata/ [L,R=301]

    # add a <Location /netdata/> section
    <Location /netdata/>
        AuthType Basic
        AuthName "Protected site"
        AuthUserFile htpasswd
        Require valid-user
        Require all denied
    </Location>
    
</VirtualHost>

 

 


Netdata registry

 

registry = node menu  (on top left corner of the Netdata dashboards)

目的:

  • enables the Netdata cloud features, such as the node view
  • multiple Netdata are integrated into one distributed application (distributed monitoring)

The registry keeps track of 4 entities:

  • machine_guid: a random GUID generated by each Netdata  (first time it starts)
  • person_guid: the web browsers accessing the Netdata installations (first time it sees a new web browser)
  • URLs of Netdata installations
  • accounts: i.e. the information used to sign-in via one of the available sign-in methods.

Default registry: https://registry.my-netdata.io

Who talks to the registry?

Your web browser only!

Flow

Browser --> netdata

 <-URL to Registry-

Browser --> Registry

Run your own registry

# Server (registry)

* Every Netdata can be a registry

[registry]
    enabled = yes
    registry to announce = http://your.registry:19999
    allow from = 192.168.123.*

Remark

(1) registry 有 DB database: /var/lib/netdata/registry/*.db

  • registry-log.db, the transaction log
  • registry.db, the database

(2) IPs allowed by [registry].allow from should also be allowed by [web].allow connection from.

# Client (netdata)

Advertise it to registry

[registry]
    enabled = no
    registry to announce = http://your.registry:19999

 

 


Web Setting

 

Disable Web Dashboard

[web]
    mode = none

Threads

# The default number of processor threads is min(cpu cores, 6)

[web]
  web server threads = 4
  web server max sockets = 512

Access lists

Netdata supports access lists in netdata.conf:

[web]
    allow connections from = localhost *
    allow dashboard from = localhost *
    allow badges from = *
    allow streaming from = *
    allow netdata.conf from = localhost
    allow management from = localhost

說明

allow badges from

checks if the API request is for a badge. Badges are not matched by allow dashboard from.

allow netdata.conf from

checks the IP to allow http://netdata.host:19999/netdata.conf

IPs allowed by allow netdata.conf from should also be allowed by allow connections from

 


Plugins

 

Internal, External, Modular plugins

  • Internal data collection plugins (running inside the Netdata daemon)
  • External data collection plugins (independent processes, sending data to Netdata over pipes)
  • Modular plugin orchestrators (external plugins that have multiple data collection modules)

netdata.conf

Disable a plug-in

在 config folder node.d python.d ..

[plugins]
    proc = yes
    diskspace = yes
    ...
    node.d = yes

Per plug-in setting

[plugin:python.d]
        # update every = 5

 


Disable IPv6 Metrics

 

mv netdata.conf netdata.conf.bak

wget -O netdata.conf http://localhost:19999/netdata.conf

[plugin:proc]
  /proc/net/sockstat6 = no
  /proc/net/snmp6 = no

systemctl restart netdata

如果無效就要另外加

[plugin:proc:/proc/net/snmp6]
        filename to monitor = none

 


Disable check 某些 NIC

 

[plugin:proc:/proc/net/dev:lxcbr0]
        enabled = no

 


"diskspace" Settings

 

Disabling performance metrics

# For individual device

[plugin:proc:/proc/diskstats:sda]
    enable performance metrics = no

# Disable By path / filesystem

 * disable data collection plugins that you don't need => Save both CPU and RAM

[plugin:proc:diskspace]
    exclude space metrics on paths = /tmp /dev/* /run/* /var/*
    exclude space metrics on filesystems = *sshfs fusectl autofs

 


Health monitoring

 

# Enable monitoring

/opt/netdata/etc/netdata/netdata.conf

[health]
    enabled = yes

/opt/netdata/etc/netdata/edit-config health_alarm_notify.conf

SEND_EMAIL="YES"

 * Default alarms shipped with Netdata.

Alerm to multi mailbox

health_alarm_notify.conf

# to receive only critical alarms, set it to "root|critical"
# 沒有 "to:" 時發比誰
DEFAULT_RECIPIENT_EMAIL="[email protected] [email protected]|critical"

...

# Alert "to: sysadmin" 時發 mail 比誰
role_recipients_email[sysadmin]="${DEFAULT_RECIPIENT_EMAIL}"

Testing Notifications

# become user netdata

su -s /bin/bash netdata

# enable debugging info on the console

export NETDATA_ALARM_NOTIFY_DEBUG=1

# send test alarms to sysadmin

/opt/netdata/usr/libexec/netdata/plugins.d/alarm-notify.sh test

...
--- BEGIN sendmail command ---
/usr/sbin/sendmail -t
--- END sendmail command ---
2021-01-07 18:23:17: alarm-notify.sh: 
 INFO: sent email notification for: 
  hypervisor.datahunter.org test.chart.test_alarm is CLEAR to '[email protected]'
# OK

Stop notifications for individual alarms (silencing the alarm)

Step1: Find the alarm configuration file

ie.

/opt/netdata/usr/lib/netdata/conf.d/health.d/net.conf

Step2: Edit the file to enable silencing

to: sysadmin

改成

to: silent

Example

# NIC Full Loading

    alarm: 5m_sent_traffic_overflow
       on: net.ens192
       os: linux
    hosts: *
 families: *
   #lookup: average -5m unaligned absolute of received
   lookup: average -5m unaligned absolute of sent
     calc: ($interface_speed > 0) ? ($this * 100 / (100 * 1000)) : ( nan )
    units: %
    every: 60s
     warn: $this > (($status >= $WARNING)  ? (80) : (85))
     crit: $this > (($status == $CRITICAL) ? (85) : (90))
    delay: down 1m multiplier 1.5 max 1h
     info: interface sent bandwidth usage over net device speed max
       to: sysadmin

content in alerm email

[ $this = 94.079845 ] [ $status = 1 ] [ $CRITICAL = 4 ]

# CPU Usage

template: 10min_cpu_usage
      on: system.cpu
      os: linux
   hosts: *
  lookup: average -10m unaligned of user,system,softirq,irq,guest
   units: %
   every: 1m
    warn: $this > (($status >= $WARNING)  ? (75) : (85))
    crit: $this > (($status == $CRITICAL) ? (85) : (95))
   delay: down 15m multiplier 1.5 max 1h
    info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
      to: sysadmin

# 用 variable

 template: 1m_received_traffic_overflow
       on: net.net
       os: linux
    hosts: *
 families: *
   lookup: average -1m unaligned absolute of received
     # $interface_speed 在之前的 template 定義出來
     calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed * 1000)) : ( nan )
    units: %
    every: 10s
     warn: $this > (($status >= $WARNING)  ? (80) : (85))
     crit: $this > (($status == $CRITICAL) ? (85) : (90))
    delay: down 1m multiplier 1.5 max 1h
     info: interface received bandwidth usage over net device speed max
       to: sysadmin

alarm vs template

Alarms

It attached to specific charts and use the alarm label. (net.eth0)

Alarms have higher precedence and will override templates.

If an alarm and template entity have the same name and attach to the same chart, Netdata will use the alarm.

Need to find the context? Hover over the date on any given chart and look at the tooltip.

Templates

define rules that apply to all charts of a specific context(net.net), and use the template label.

Templates help you apply one entity to all disks, all network interfaces, all MySQL databases, and so on.

解說

on:

Which chart the entity listens to

lookup:

This line makes a database lookup to find a value. This result of this lookup is available as $this

lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS] [foreach DIMENSIONS]

METHOD

  one of average, min, max, sum, incremental-sum

average:

    Calculate the average of all the metrics collected.

percentage:

    Clarify that we're calculating a percentage of RAM usage.

    of used: Specify which dimension (used) on the system.ram chart you want to monitor with this entity.

AFTER

a relative number of seconds, but it also accepts a single letter for changing the units,

like -1s = 1 second in the past, -1m = 1 minute in the past, -1h = 1 hour in the past

OPTIONS

space separated list of percentage, absolute, min2max, unaligned, match-ids, match-names

i.e.

lookup: average -10m unaligned of user,system,softirq,irq,guest

units:

"calc:" 回來的值的 units

every:

How often to perform the lookup calculation to decide whether or not to trigger this alarm.

warn/crit:

The value at which Netdata should trigger a warning or critical alarm.

warn: EXPRESSION

i.e.

  warn: $this > 80
  crit: $this >= 90

conditional evaluation operator "?"

The conditional evaluation operator ? is supported too.

Using this operator IF-THEN-ELSE conditional statements can be specified.

The format is: (condition) ? (true expression) : (false expression).

hysteresis(":")

warn: $this > (($status >= $WARNING)  ? (75) : (85))

it will trigger a warning the first time it goes above 85,
but will remain a warning until it goes below 75 (or goes above 85).

percentage

instead of returning the value, calculate the percentage of the sum of the selected dimensions,
versus the sum of all the dimensions of the chart. This also sets the units to %.

absolute or abs, turn all values positive and then sum them.

min2max, when multiple dimensions are given, do not sum them, but take their max - min

special variables

$this, which is resolved to the value of the current alarm.

$status, which is resolved to the current status of the alarm

  This values can be compared with

  $REMOVED, $UNINITIALIZED, $UNDEFINED, $CLEAR, $WARNING, $CRITICAL.

  These values are incremental, ie. $status > $CLEAR works as expected.

$now, which is resolved to current unix timestamp.

unaligned

when data are reduced / aggregated (e.g. the request is about the average of the last minute, or hour),
Netdata by default aligns them so that the charts will have a constant shape
(so average per minute returns always XX:XX:00 - XX:XX:59).
Setting the unaligned option, Netdata will aggregate data without any alignment,
so if the request is for 60 seconds, it will aggregate the latest 60 seconds of collected data.

calc:

A calculation to apply to the value found via lookup or another variable.

green/red:

Set the green and red thresholds of a chart.

Both are available as $green and $red in expressions.

These will eventually visualized on the dashboard.

exec:

The script to execute when the alarm changes status.

repeat:

Format: repeat: [off] [warning DURATION] [critical DURATION]

The interval for sending notifications when an alarm is in WARNING or CRITICAL mode.
This will override the default interval settings inherited from health settings in netdata.conf
(default repeat warning = DURATION and default repeat critical = DURATION)
Use 0s to turn off the repeating notification for WARNING / CRITICAL mode.

ie

repeat: warning 600s critical 600s

delay:

delay: [[[up U] [down D] multiplier M] max X]

up U

defines the delay to be applied to a notification for an alarm that raised its status (i.e. CLEAR to WARNING, CLEAR to CRITICAL, WARNING to CRITICAL). For example, up 10s, the notification for this event will be sent 10 seconds after the actual event. This is used in hope the alarm will get back to its previous state within the duration given. The default U is zero.

mutliplier M

multiplies U and D when every time an alarm changes state, while a notification is delayed.

The default multiplier is 1.0.

ie

delay: down 15m multiplier 1.5 max 1h

info:

A description of the alarm, which will appear in the dashboard and notifications.

Reload health configuration

To make any changes to your health configuration live, you must reload Netdata's health monitoring system.

To do that without restarting all of Netdata, run the following:

killall -USR2 netdata

OR

netdatacli reload-health
 


netdatacli

 

netdatacli help

reload-health

ping

 


Stopping notifications for individual alarms

 

設定

cd /opt/netdata/etc/netdata

./edit-config health.d/btrfs.conf        # call nano to create  health.d/btrfs.conf

# To silence this alarm, change sysadmin to silent.

to: silent

# reload

killall -USR2 netdata

 


Central Netdata server (streaming)

 

Netdata slaves streaming metrics to upstream Netdata servers(statsd),

use exactly the same protocol local plugins use.

 

statsd

Port: 8125/TCP, P8125/UDP

statsd is a system to collect data from any application.

Applications are sending metrics to it, usually via non-blocking UDP communication,

and statsd servers collect these metrics,

perform a few simple calculations on them and push them to backend time-series databases.

 * Netdata is a fully featured statsd server.

 * Netdata statsd is inside Netdata (an internal plugin, running inside the Netdata daemon)

Disable statsd

[statsd]
  enabled = no

 


Database Queries

 

API 入口: /api/v1/data

i.e.

curl -Ss 'http://localhost:19999/api/v1/data?chart=system.cpu'

{
 "labels": ["time", "guest_nice", "guest", "steal", "softirq", "irq", "user", "system", "nice", "iowait"],
    "data":
 [
      [ 1707117060, 0, 0, 0, 0.359386, 0, 0.2032042, 1.1016693, 7.175965, 0.0151144],
      [ 1707117000, 0, 0, 0, 0.344185, 0, 0.1880425, 1.0241601, 6.838367, 0.0151106],
      [ 1707116940, 0, 0, 0, 0.3038085, 0, 0.191349, 0.9735301, 6.366551, 0.0117495],
      [ 1707116880, 0, 0, 0, 0.3712289, 0, 0.2166902, 1.2094336, 7.600954, 0.0100786],
      [ 1707116820, 0, 0, 0, 0.2635422, 0, 0.1846474, 0.9182012, 5.74757, 0.0016786],
      [ 1707116760, 0, 0, 0, 0.3324211, 0, 0.2115407, 1.0979971, 7.002669, 0.0184678],
      [ 1707116700, 0, 0, 0, 0.2955053, 0, 0.2115549, 1.0090834, 6.355043, 0.003358],
      [ 1707116640, 0, 0, 0, 0.2734991, 0, 0.1929595, 0.899359, 5.882748, 0.0050337],
      [ 1707116580, 0, 0, 0, 0.2802484, 0, 0.1762041, 0.9833865, 6.5464, 0.0050344],
      [ 1707116520, 0, 0, 0, 0.2601151, 0, 0.1879542, 0.8441155, 5.838326, 0.0067126]
  ]
}

curl -Ss 'http://localhost:19999/api/v1/data?chart=system.cpu&format=csv&after=-600&points=1&options=percentage'

format

  • json    # Default
  • html
  • csv
  • ...

after & before

define a time-frame, accepting

points

The number of points to be returned.

If not given, the result will have the same granularity as the database

group

The grouping method to use when reducing the points the database has.

If not given, it defaults to average.

options

Only 2 options are used by the query engine: unaligned and percentage.

All the other options are used by the output formatters.

The default is to return aligned data.

units

netdata has hard-coded units

  • disk I/O is in kilobytes/s
  • disk size is in GB
  • memory size is in MB
  • network bandwidth in is kilobits/s
  • temperatures in Celcius

 


Export and import a snapshot

 

Snapshots can be incredibly useful for diagnosing anomalies after they've already happened.

Let's say Netdata triggered an alarm while you were sleeping.

The generated snapshot will include all charts of this dashboard, for the visible timeframe

To export a snapshot, click on the export icon.

The snapshot will be downloaded as a file, to your computer,

that can be imported back into any netdata dashboard (no need to import it back on this server).

 


Performance Tuning

 

# CPU

[web]
    enable gzip compression = no

# Disable logs

you're not actively auditing Netdata's logs, disable them in netdata.conf

[global]
    # debug log = /opt/netdata/var/log/netdata/debug.log
    # error log = /opt/netdata/var/log/netdata/error.log
    # access log = /opt/netdata/var/log/netdata/access.log
    debug log = none
    error log = none
    access log = none

 


Netdata 的自身 Info.

 

How many metrics, on average, do your Agents collect?

Dashboard 的右下角有寫

Every 5 seconds, Netdata collects 3,387 metrics on home, 
presents them in 654 charts and monitors them with 268 alarms.
 
netdata
v1.28.0

What is your compression savings ratio?

Search "dbengine_compression_ratio" on dashboard (Netdata Monitoring / dbengine)

Typical compression ratio of 80%

Disk Usage (1 day)

假設

多久收集 metrics (update every) = 5
每次收集幾多 metrics = 2000
每 metrics 佔用空間 = 4 bytes
在沒有壓縮情況下
 => (2000 * 3600 * 24 * 4 / 1)/(1024^2) = 659 MB

 


Monitoring Kernel Memory de-duplication performance

 

Netdata will create charts for kernel memory de-duplication - deduper (ksm)

  • mem.ksm,
  • mem.ksm_savings (savings/offered),
  • mem.ksm_ratios (%)

Config

[plugin:proc]
    /sys/kernel/mm/ksm = yes

[plugin:proc:/sys/kernel/mm/ksm]
    /sys/kernel/mm/ksm/pages_shared = /sys/kernel/mm/ksm/pages_shared
    /sys/kernel/mm/ksm/pages_sharing = /sys/kernel/mm/ksm/pages_sharing
    /sys/kernel/mm/ksm/pages_unshared = /sys/kernel/mm/ksm/pages_unshared
    /sys/kernel/mm/ksm/pages_volatile = /sys/kernel/mm/ksm/pages_volatile

 


V1.37

 

OS: Rocky 8 # el8

[db]
mode = dbengine

To minimize resource utilization and should only be considered on Parent - Child setups
mode = dbengine 只用於 Parent & Child 在同一架機. (single node setup)

[db]
retention = 3600

"retention" controls the size of the database in memory (except for [db].mode = dbengine)

"[db].update every = 2" AND "[db].retention = 1800" => 1 hr data

Settings

[web]
    enable gzip compression = no
    disconnect idle clients after seconds = 3600

[ml]
   enabled = no

[logs]
    # access = /var/log/netdata/access.log
    access = none
    
[plugin:proc]
    /proc/net/sockstat6 = no

[plugin:proc:/proc/net/snmp6]
    filename to monitor = none

[plugin:proc:/proc/net/dev:lo]
    enabled = no