AWS - Athena

最後更新: 2024-03-04

介紹

Athena queries data directly from Amazon S3.

支援 Querying

  • SQL
  • Apache Spark

應用例子

Analysis aws waflogs from S3

目錄

  • Pricing
  • Analysis aws waflogs from S3
  • 建立 Data Source
  • 建立 Workgroup
  • Usage
  • WAF log 常用的 fields
  • Athena create saved query

 


Pricing

 

SQL: $5.00 per TB of data scanned

  • Rounded up to the nearest megabyte, with a 10 MB minimum per query
  • no charges for Data Definition Language (DDL) statements like CREATE, ALTER, or DROP TABLE, or failed queries.

Apache Spark: $0.35 per DPU-hour billed per minute.

Additional costs

You are charged standard S3 rates for storage, requests, and data transfer.

If you use the AWS Glue Data Catalog with Athena, you are charged standard Data Catalog rates. (Data Source)

省錢方式

Partitioning

限制 scan data 的數量

Gzip log file

Athena 支援 scan gz file => size 細左會平左

Parquet format

Parquet is columnar

The query in question only references X column, Athena reads only that column and can avoid reading other file.

 


Analysis aws waflogs from S3

 

Create an AWS Athena table to query the AWS WAF logs stored in S3.

This allows you to perform SQL-like queries on the log data.

log file path in S3

S3 > Buckets > "Name-Of-Bucket" > AWSLogs/AWS_AC_ID/WAFLogs/cloudfront/Web-ACLs-Name/YYYY/MM/DD/hh/mm/files

  • hh: 00~23
  • mm: 00, 05, 10 ...
  • files: ?.log.gz    # 一行一 json

 

 


Data Source

 

data source = metadata (schema) +  dataset (CSV, JSON, Parquet ...)

Athena natively supports the "AWS Glue Data Catalog"

$1.00 per 100,000 objects over a million, per month.
An object in the Data Catalog is a table, table version, partition, partition indexes, statistics or database.

$1.00 per million requests over the first million.
Some of the common requests are CreateTable, CreatePartition, GetTable , GetPartitions, and GetColumnStatisticsForTable.

建立 "Workgroups"

左邊 menu > Administration section > Data sources

  1. S3 - AWS Glue Data Catalog
  2.  

 


建立 Workgroup

By default, each account has a primary workgroup and
the default permissions allow all authenticated users access to this workgroup.

The primary workgroup cannot be deleted.

Each workgroup that you create shows saved queries and query history only for queries that ran in it.

Disabling a workgroup prevents queries from running in it, until you enable it.

Queries sent to a disabled workgroup fail, until you enable it again.

Query result configuration (optional)

將 Run 完的 result save 到 S3

i.e.

建立 "Workgroups"

左邊 menu > Administration section > Workgroups

Analytics engine: Athena SQL / Apache Spark

Data usage controls(Per query data usage control)

If the query exceeds the limit, it will be cancelled (Minimum limit is 10 MB)

Partition projection

Automatically adds new partitions as new data is added.

This removes the need for you to manually add partitions by using ALTER TABLE ADD PARTITION.

 


Usage

 

0. 選: us-east-1 

1. Create DB

Query editor >

CREATE DATABASE waf_logs

2. Create Tables

Define the table schema based on the log format and structure.

Tables DDL statement

https://docs.aws.amazon.com/athena/latest/ug/waf-logs.html

3. 建立並執行 SQL

行完會有 Result

Completed                 Time in queue: 60 ms Run time: 3.118 sec Data scanned: 52.11 MB

i.e.

# Return records from the last 24 hours

SELECT to_iso8601(from_unixtime(timestamp/1000)) AS time_ISO_8601,
       httpsourcename,
       httpsourceid,
       httprequest
FROM waf_logs
WHERE from_unixtime(timestamp/1000) > now() - interval '1' day
LIMIT 10;

# Search by URL

SELECT to_iso8601(from_unixtime(timestamp/1000)) AS time_ISO_8601,
       httpsourceid as "CF Distribution",
       terminatingruleid as "Last issued rule",
       action as "Action",
       httprequest.clientip as "Client IP",
       httprequest.country as "Country",
       httprequest.uri as "Request URI",
       httprequest.httpmethod as "Method",
       httprequest.headers as "Full request header"
FROM waf_logs
WHERE httprequest.uri like '/stripe%'

 


WAF log 常用的 fields

 

i.e.

  • "timestamp": 1700694004928
  • "terminatingRuleId": "AWS-AWSManagedRulesAnonymousIpList"
  • "terminatingRuleType": "MANAGED_RULE_GROUP"
  • "action": "BLOCK"
  • "httpSourceId": "CF 的 Distributions 名"

httpRequest # client info 的 json

  • clientIp
  • country
  • headers
  • uri
  • httpMethod
  • requestId              # httprequest.requestId

 


Athena create saved query

 

 * You cannot change the workgroup or database to which the query belongs.

Above the query editor window, on the tab for the query, choose the three vertical dots, and then choose Save as.

 


 

 

 

Creative Commons license icon Creative Commons license icon