awk

最後更新: 2022-03-21

介紹

pattern scanning and processing language

目錄

 


test.txt 的內容

 

# Name     Ages     Salary

Beth    18    9000
Dan     23
Kathy   32    6400
Mark    28    7030
Mary    25

 


Basic Usage

 

awk 'Program' input-file1 input-file2 ...

Program 的格式

  • action
  • 'filter {action}'

filter

/regex/          # 相當於 $0 ~ /regex/
$2 !~ /regex/    # $2 不中 regex

print

# $2 及 $4 的 Value 會連在一起(中間無空格)

awk '$2 !~ /th/ {print $2 $4}' log.txt        

# commas(",")

awk '$2 !~ /th/ {print $2, $4}' log.txt

new line

Each print statement makes at least one line of output.

awk 'BEGIN { print "line one\nline two\nline three" }'

 


多行 program

 

awk '/This program is too long, so continue it\
 on the next line/ { print $1 }'

 

Run program(-f)

# -f program-file      # Read the AWK program source from the file program-file

prog.txt

{    # "{}" 依然要 keep
  print NF, $1
  ... 
}

NF: The number of fields in the current input record.

awk -f prog.txt test.txt

3 Beth
2 Dan
3 Kathy
3 Mark
2 Mary

 

Call by filter

awk-filter.awk

#!/usr/bin/awk -f
BEGIN { print "Name\tSalary" }
{ print $1, "\t", $3}
END { print " - DONE -" }

chmod o+x awk-filter.awk

cat test.txt | ./awk-filter.awk

Remark

# -f program-file      # Read the AWK program source from the file program-file

 

awk as script

chmod o+x myscript.awk

myscript.awk:

#!/usr/bin/awk -f
BEGIN{ 
  mystr="abcdefgh"
  print length(mystr)
}

BEGIN Block is needed if you are not providing any input to awk either by file or standard input.

This block executes at the very start of awk execution even before the first file is opened.

 


Print starting from column

 

# Print all but the first column:

awk '{$1=""; print $0}' somefile

# Print all but the first two columns:

awk '{$1=$2=""; print $0}' somefile

# i=start, i<stop

awk '{for(i=1;i<=NF-1;i++) printf $i" "; print ""}' somefile

 


BEGIN & END

 

BEGIN: 在所有 row 之前執行

END: 行完所有 row 之後執行

Example: 計算月份 Jan size

ls -l | awk '$6 == "Jan" { sum += $5 } END { print sum }'

Example: 直行 program, 不用 input (stdin / file)

awk 'BEGIN { print "line one\nline two\nline three" }'

 


選擇某幾行(FNR)

 

# FNR: The input record number in the current input file

awk 'FNR == 2 {print}' test.txt

 * FNR 由 1 開始數, 不是由 0

 


Column Select($N)

 

As each input record is read, gawk splits the record into fields, ($1, $2 ...)

    using the value of the env FS variable / "-F fs" as the field separator. (預設係空格/Tab)

ie. 1

awk '{ print $1,$3 }' test.txt

ie.2

awk '$3 > 0 {print $1, $2 * $3}' test.txt

 * $0 whole line

 * Last field $NF

ie. 3

awk '$1 == "findtext" {print $3}' filename

* 一定要 "" value

 

設定Column的分隔符號 (FS)

 

awk 'BEGIN { FS = "," } ; { print $2 }'

Default FS: space and tab act as field separators

 


regexp

 

# 找出中某 pattern 的 row

ls -l | awk '/-rw-------/ {print $9}'

# This method uses regexp, it should work:

awk '$2 ~ /findtext/ {print $3}' filename
  • \          # This is used to suppress the special meaning of a character
  • ^          # This matches the beginning
  • $          # end of a string.
  • [...]      # This is called a character list.( [A-Za-z0-9] )
  • [^ ...]    # It matches any characters except those in the square brackets
  • |          # This is the alternation operator
  • .          # matches any single character(including the newline character) (1)
  • *          # as many times as necessary (any)
  • +          # at least once (>1)
  • ?          # once or not (0,1)

[...]

\]

設定 digits_regexp

BEGIN { digits_regexp = "[[:digit:]]+" }
$0 ~ digits_regexp    { print }

集合

  • '[0-9]' is equivalent to '[0123456789]'
  • [:alnum:]    Alphanumeric characters
  • [:alpha:]    Alphabetic characters
  • [:blank:]    Space and TAB characters
  • [:digit:]      Numeric characters
  • [:xdigit:]    Characters that are hexadecimal digits
  • [:space:]    Space characters (these are: space, TAB, newline, carriage return, formfeed and vertical tab)

Example

The following condition prints the line if it contains the word "whole" or columns 1 and 2 contain "part1" and "part2" respectively.

($0 ~ /whole/) || (($1 ~ /part1/) && ($2 ~ /part2/)) {print}

This can be shortened to

/whole/ || $1 ~ /part1/ && $2 ~ /part2/ {print}

The following prints all lines between 20 and 40:

(NR==20),(NR==40) {print}

 


filter 的 OR 與 AND

 

AND --> &&
OR  --> ||

test.txt

Y       1
N       2
X       3
Y       4
N       5
X       6

兩 filter (OR)

# 相當於 ls -l | grep -e Nov -e Dec

#                       filter1          filter2
cat test.txt | awk '$1~/Y/{print $2}; $1~/N/{print $2}'

可寫成

($1~/Y/ || $1~/N/){print $2}

awk multiple condition (AND)

awk 'BEGIN {FS=":"}; $3>500 && $3<65534 { print $1 }' /etc/passwd

"&&" and "||". There is also the unary not operator: "!".

 

 


if ... else ...

 

Syntax:

if (conditional-expression)
{
    action1;
    action2;
}

"{" 及 "}" 非必須

if(conditional-expression1)
    action1;
else if(conditional-expression2)
    action2;
else
    action n;

i.e.

if ($(NF+1) != "")
    print "can't happen"
else
    print "everything is normal"

 

Ternary operator ( ?: )

Syntax

(conditional-expression) ? expr1 : expr2 ;

Example: 找出大那個的值

# 錯誤

awk 'BEGIN { a = 10; b = 20; (a > b) ? print a : print b;}'

原因: Ternary operator 的結果是 expr

expression v.s. statements

statements are actions that are carried out, while expressions are computations that result in a value.

# 正確寫法

awk 'BEGIN { a = 10; b = 20; (a > b) ? max = a : max = b; print max}'

 


Feild Number (NF)

 

awk '{ print NF, $1 }' test.txt

Output:

3 Beth
3 Dan
2 Kathy
3 Kathy
3 Mark
2 Mary
3 Susie

移除空行

awk 'NF > 0' input.txt > output.txt          # 相當於 sed -i '/^$/d' input.txt

 


Maths: + - * / == != < > >= <= ~ !~

 

awk '{ $6 = ($5 + $4 + $3 + $2) ; print $6 }' inventory-shipped

 


Assign Variable

 

Local Variable

cat test.txt

1
2
3
4

awk '$1>2{MyVar=$1; print MyVar}' test.txt

3
4

Outside Variable

-v var=value    # assigns value to program variable var

 * Such variable values are available to the BEGIN block of an AWK program

/root/test.txt

1       a
2       b
3       c

awk -v myval=mymsg '{print myval ": " $2;}' /root/test.txt

output

mymsg: a
mymsg: b
mymsg: c

 * var 不用加 "$" 號 !!

 


特別的 Variable

 

FNR        # The current record number in the current file

NF          # The number of fields in the current input record

NR          # The number of input records awk has processed since the beginning of the program's execution

BEGIN     # 行完所有 row 後執行

END

 


String Functions

 

  • length(string)
  • index(string,search)
  • match(str, regex)
  • substr(string, position [, length])
  • sub(regexp, replacement [, target])
  • gsub(r, s [, t])
  • split(string, array, separator)
  • strtonum(str)      # 12, 012, 0x12
  • tolower(str)
  • toupper(str)

length(str)

test.awk

#!/usr/bin/awk -f
BEGIN{ 
  mystr="abcdefgh"
  print length(mystr)
}

index(str, char)

 * start with 1

test.awk

#!/usr/bin/awk -f
BEGIN{
  sentence = "This is a short, meaningless sentence.";
  if (index(sentence, ",") > 0) {
   printf("Found a comma in position %d \n", index(sentence, ","));
  }
}

match(str, regex)

It returns the index of the first longest match of regex in string str.

It returns 0 if no match found.

The regexp argument may be either a regexp constant (/…/) or a string constant ("…").

In the latter case, the string is treated as a regexp to be matched.

more

substr(string, position [, length])

# 提取 String 的其中一段. 當沒有設定 length 時, 那就會由 position 去到尾

#!/bin/awk -f
{ print substr($0, 40, 10) }

sub(regexp, replacement [, target])

sub = substitution

This function performs single substitution. (It replaces the first occurrence)

(for the leftmost, longest substring matched by the regular expression regexp)

The third parameter(target) is optional. If it is omitted, $0 is used.

str = "water, water"
sub(/at/, "ith", str)
print str               # wither, water

i.e. Remore String

# 當 $3 是 "%[email protected]" 時,

# Output 可轉成 [email protected]

'sub(/%sasl_username=/,"",$3); print $3'

gsub(r, s [, t])

g = global

For each substring matching the regex r in the string t,

substitute the string s, and return the number of substitutions.

split(string, array, search)

# This script breaks up the sentence into words, using a space as the character separating the words

 * array 係由 1 開始數起

#!/usr/bin/awk -f
BEGIN {
    string = "This is a string, is it not?";
    search = " ";
    n = split(string, array, search);
    for (i=1; i<=n; i++) {
        printf("Word[%d]=%s\n", i, array[i]);
    }
    exit;
}

Result:

Word[1]=This
Word[2]=is
Word[3]=a
Word[4]=string,
Word[5]=is
Word[6]=it
Word[7]=not?

 


printf

 

有了 printf, 我地可以 print no newline

print will insert a newline => use printf instead

{printf "%s", $n}

Usage

printf format, item1, item2, …

Format-Control Letters

  • %c              # Print a number as a character; thus, ‘printf "%c", 65’ outputs the letter ‘A’.
  • %d, %i        # Print a decimal integer.
  • %f               # Print a number in floating-point notation. (printf "%4.3f", 1950)
  • %s              # Print a string.

Modifiers

width

This is a number specifying the desired minimum width of a field. (pad with spaces)
( Inserting any number between the ‘%’ sign and the format-control character)

- (Minus)

The minus sign, used before the width modifier

.prec

A period followed by an integer constant specifies the precision to use when printing.

i.e.

ps -axl | awk '{printf "%3d  %s" "\n", $6, $13}'

 


Casting

 

int()

test.txt

'1'
'12'
'123'

cat test.txt  | awk '{gsub(/[^0-9]/,"",$0); print $0}'

1         # 這裡的 1, 12, 123 都不是數字來.
12
123

Casting

cat test.txt  | awk '{gsub(/[^0-9]/,"",$0); if (int($0) > 11){print $0} }'

 


Associative Arrays

 

Set

array[index-expression] = value

Get

if (a["foo"] != "") …
{
    for (i = 1; i <= NF; i++)
        used[$i] = 1
}

 


應用例子

 

#1 找出 sasl user的 login 及時間

grep sasl_username /var/log/maillog > report.tmp

awk -f sasl_login.awk report.tmp > report.txt

sasl_login.awk

{
# remove tag 's name
sub(/sasl_username=/,"",$9)

sp = index($7, "[")
np = index($7, "]")
len = np - sp - 1
ip = substr($7, sp + 1, len)

print $1" "$2" "$3" "ip"\t"$9
}

Notes

 * IP 有長有短, 所以不應用空格, 要用 "\t"

report.tmp

Aug 14 04:03:00 <HOSTNAME> <Daemon> MID: client=unknown[R.R.R.R], sasl_method=LOGIN, sasl_username=U@D

#2 當前 TCP connection 的種類

netstat -ant | awk 'FNR>2{print $6}' | sort | uniq -c | sort -n

      2 TIME_WAIT
      4 FIN_WAIT2
      7 ESTABLISHED
      8 LISTEN

P.S.

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.1:9000          0.0.0.0:*               LISTEN
...

 


多 Filter

 

get_outgoing_mail_statistics.sh

#!/bin/bash

quota=1

/usr/sbin/postfwd --dumpcache |\
awk \
'($7 == "$count"){ \
        sub(/%sasl_username=/,"",$3); \
        gsub(/[^0-9]/,"",$9); \
        if (int($9) >= '$quota'){printf $3 " " $9}};\
 ($7 == "$maxcount"){ \
        gsub(/[^0-9]/,"",$9); \
        {print "/"$9}};
'

 


Capturing Groups

 

match(string, regexp [, array])

arr[0]                   # entire portion of string matched by regexp

arr[1], arr[2] ...    # contain the portion of string matching the corresponding parenthesized subexpression

 * 如果 array 已經存在, 那它會被清空先.

應用

Postfix smtp 的 mail flow 最終有兩個 status, 分別是

  • status=bounced
  • status=sent

# "status=bounced" => /var/log/mail

# $1$2" "$3" "$7"    => 日期時間 to=<r@R>, bounced|sent

echo "status=bounced" | awk 'match($0, /status=(bounced|sent)/, arr) {print "$1 $2 $3" - "$7" "arr[1]}'

 


awk - system()

 

 

system() = awk system() function

ipcs -s | awk -v user=apache '$3==user {system("ipcrm -s "$2)}'