Solution:
Three fields are needed from the orignial log in order to answer the questions, IP, URL, timestamp.
One of the ways to sessionize the log is to reorganize the log into a collection of <K,V> pairs, where K is IP address and V is a list of sessions. Each session is a tuple 3 that includes last_access_time, Set(URL, URL,...), and
total_session_time.
Step 1 Loading data into RDD
Load the orignial log into initRDD, using IP as the key without the port number. The value is initially a tuple
(Long, String), in which the first element is the millisecond from epoch, converted from the timestamp,
and the second element is the URL.
Step 2 Aggregate data by IP
Run aggregateByKey and combine the tuples produced in previous step into List[(Long, String)]. The
list elements are sorted by the millisecond from epoch. By the end of this step, the <K,V> pair is in the form
of <IP, List[(Long, String)]>
Step 3 Sessionize weblog by IP
Run mapValues on each K. Tuples in the list are merged into a session if the difference of timestamp
between two neighbouring tuples is less than 15 minutes, URLs are unioned into a set, and the total session
time is accumulated. The final format of value V is ListBuffer[(Long, Set[String], Int)]. The
elements of the tuple (Long, Set[String], Int) are last_access_time, Set(URL, URL,...), and
total_session_time as described earlier.