Querying AWS ELB access logs with EMR

On March of 2014 Amazon announced a new feature for ELB (Elastic Load Balancer): Access logs.

Those logs contain access information, and from there you can extract things like the average response time of 3 involved parts:

  • Request processing time: Total time elapsed (in seconds) from the time the load balancer receives the request and sends the request to a registered instance

  • Backend processing time: Total time elapsed (in seconds) from the time the load balancer sends the request to a registered instance and the instance begins sending the response headers.

  • Response processing time: Total time elapsed (in seconds) from the time the load balancer receives the response header from the registered instance and starts sending the response to the client. This processing time includes both queuing time at the load balancer and the connection acquisition time from the load balancer to the backend.

See the full AWS ELB access log documentation.

This blog post is about querying ELB (Elastic Load  Balancer) access logs with EMR (Elastic Map Reduce).

1 – Initializating EMR instance

elastic-mapreduce --create --alive --name "Pau Gay - AWS ELB Logging analysis" --hive-interactive --num-instances 4 --master-instance-type c3.xlarge --slave-instance-type c3.4xlarge
elastic-mapreduce --ssh {jobId}

# once logged into the box
sudo apt-get install tmux -y
tmux
hive

2 – Running the hive command

add jar file:/home/hadoop/.versions/hive-0.8.1/lib/hive-builtins-0.8.1.jar;
add jar file:/home/hadoop/.versions/hive-0.8.1/lib/hive-serde-0.8.1.jar;
add jar file:/home/hadoop/.versions/hive-0.8.1/lib/hive-contrib-0.8.1.jar;

SET hive.mapred.supports.subdirectories=true;
SET mapred.input.dir.recursive=true;

DROP TABLE elb_raw_access_logs;
CREATE EXTERNAL TABLE elb_raw_access_logs (
  `Timestamp` STRING,
  ELBName STRING,
  RequestIP STRING,
  RequestPort STRING,
  BackendIP STRING,
  BackendPort STRING,
  RequestProcessingTime STRING,
  BackendProcessingTime STRING,
  ClientResponseTime STRING,
  ELBResponseCode STRING,
  BackendResponseCode STRING,
  ReceivedBytes STRING,
  SentBytes STRING,
  RequestVerb STRING,
  URL STRING,
  Protocol STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\"$"
)
LOCATION '{YOUR_ELB_ACCESS_LOGS_S3_LOCATION};

--
# you can test all works with the following query
select * from elb_raw_access_logs limit 10;

# and this will give you the average response time metrics
select
  avg(RequestProcessingTime) as average_request_processing_time,
  avg(BackendProcessingTime) as average_request_processing_time ,
  avg(ClientResponseTime) as average_request_processing_time
from elb_raw_access_logs;

This is all.

I’m sharing this hive query because I ran in lots of troubles with the RegexSerDe, so I figure out myself the addition of the above listed jars. See that you can only map a regular expression to a String, but yo can do casts afterwards to convert data into your preferred type.

Nb. Remember to terminate your EMR instance:

root@pau-vm:~# elastic-mapreduce --list --active
j-ABCD     WAITING        xxxx         Pau Gay - AWS ELB Logging analysis
   COMPLETED      Setup Hive
root@pau-vm:~# elastic-mapreduce --terminate j-ABCD
Terminated job flow j-ABCD