Solution:

Three fields are needed from the orignial log in order to answer the questions, IP, URL, timestamp.

One of the ways to sessionize the log is to re­organize the log into a collection of <K,V> pairs, where K is IP address and V is a list of sessions. Each session is a tuple 3 that includes last_access_time, Set(URL, URL,...), and

total_session_time.

Step 1 Loading data into RDD

Load the orignial log into initRDD, using IP as the key without the port number. The value is initially a tuple

(Long, String), in which the first element is the millisecond from epoch, converted from the timestamp,

and the second element is the URL.

Step 2 Aggregate data by IP

Run aggregateByKey and combine the tuples produced in previous step into List[(Long, String)]. The

list elements are sorted by the millisecond from epoch. By the end of this step, the <K,V> pair is in the form

of <IP, List[(Long, String)]>

Step 3 ­ Sessionize weblog by IP

Run mapValues on each K. Tuples in the list are merged into a session if the difference of timestamp

between two neighbouring tuples is less than 15 minutes, URLs are unioned into a set, and the total session

time is accumulated. The final format of value V is ListBuffer[(Long, Set[String], Int)]. The

elements of the tuple (Long, Set[String], Int) are last_access_time, Set(URL, URL,...), and

total_session_time as described earlier.

Step 4 ­ Determine the Average Session Time

Step 5 ­ Determine Unique URL Visits per Session

Step 6 ­ Find the Most Engaged Users

results matching ""

    No results matching ""