Dan Fitch's posterous account

Archive for

January 2011

Hive parse_url

I have recently started working with Hive, an Apache project which sits on top of Hadoop and handles all of the management of turning SQL like commands into Map Reduce statements.  So far it is great, and I am using it to parse gigs worth of log data and make it available for queries, and reports.  

So this comes to a string manipulations method that works really well for this.  I am most interested in the query parameters on the url and parse_url has become my friend.  Parse_url takes 2 parameters that I use, the url, and the part of the url that you want the function to return.  

If your select statement had a field named url the you would use the method below to extract the query data from the url.

url = "http://danfitch.net/stuff?a=1&b=42&x=abc"

parse_url(url,"QUERY")

The value that is returned is "a=1&b=42&x=abc".  And because we can take this one step further and convert the query into a hash map using another great method "str_to_map".  str_to_map takes 3 parameters, the string you want to convert, the character separating the key value pairs, and the character separating the key from the value.

params = "a=1&b=42&x=abc"

str_to_map(str_to_map,"&","=")

The value that is returned is a Hash with the keys and values {"a"=>"1","b"=>"42","x"=>"abc"}.  And because Hive supports a MAP type you take the values returned from str_to_map and store them into a MAP field.  So if your url changes and you add a parameter to it you won't have to update your Hive table it will just accept the new key and value into its MAP

 

Example

 

 

CREATE TABLE AKAMAI_LOGS (
ip STRING,
day STRING,
time STRING,
request STRING,
result STRING,
bytes STRING,
referrer STRING,
user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\S+)\\s\\S+\\s\\S+\\s\\[([^:\\]]*):([^\\]]*)\\]\\s\"([^\"]*)\"\\s(\\d*)\\s(\\d*)\\s\"([^\"]*)\"\\s\"([^\"]*)\"\\s\"([^\"]*)\"$"
);

 

CREATE TABLE REQUESTS (
dt STRING,
ip STRING,
params MAP<STRING,STRING>)
PARTITIONED BY(day STRING)
STORED AS SEQUENCEFILE;

#Below I get the raw data from the logs and prep it before putting it into my table.

INSERT OVERWRITE TABLE REQUESTS PARTITION(day)
select akama_logs_output.dt, akama_logs_output.ip, akama_logs_output.params, akama_logs_output.day
from 
(select concat(ads.day," ",split(ads.time," ")[0]) as dt, ads.ip as ip,str_to_map(parse_url(split(ads.request[1]),"QUERY"),"&","=") as params, ads.day
FROM AKAMAI_LOGS ads 
) akama_logs_output;

 

 

 

 

 

Filed under  //   apache hadoop   apache hive   hadoop   hive   log parsing   map reduce