Loading data

Step 1. Download

download the text format of the Bible from https://sites.google.com/site/ruwach/bibletext

Step 2. Data extract

extract New Testament

Step 3. Data consolidation

Concatenate all section together into one Peter.txt file

cat *.txt > Peter.txt

Step 4. Load data into HDFS

hdfs dfs -put /root/Downloads/NewTestament/1_2Peter/Peter.txt hdfs://localhost:9000/Peter

Step 5. Data verification:

hdfs dfs -ls hdfs://localhost:9000/Peter

hdfs dfs -cat hdfs://localhost:9000/Peter

Step 6. Analysis in Spark

val input = sc.textFile("hdfs://localhost:9000/Peter")

val counts = input.flatMap(line => line.split(" ")).map(word =>(word, 1)).reduceByKey(+);

counts.saveAsTextFile("Peter_output")

hdfs dfs -tail hdfs://localhost:9000/user/root/Peter_output/part-00000

val file = sc.textFile("hdfs://localhost:9000/Peter")

val counts = file.flatMap(line => line.split(" "))

.map(word =>(word, 1))

.reduceByKey(+)

.sortByKey(true, 1)

counts.saveAsTextFile("Peter_SortedOutput2")

This produced same result as the very first practice on input.txt

run counts.

Reset Scala session

Bible word counting

val file = sc.textFile("hdfs://localhost:9000/Peter")

val counts = file.flatMap(line => line.split(" ")).map(p => (p,1)).reduceByKey(+).sortByKey(false, 1)

counts.saveAsTextFile("Peter_SortedOutput5")

hdfs dfs -cat hdfs://localhost:9000/user/root/Peter_SortedOutput5/part-00000

Final solution

(provided by StackOverflow Rockie Yang)

import org.apache.spark.ml.feature.StopWordsRemover

import org.apache.spark.sql.functions.split

val lines = sc.textFile("hdfs://localhost:9000/Peter").map(.replaceAll(raw"A-Za-z0-9\s+", "").trim.toLowerCase).toDF("line")_

val words = lines.select(split($"line", " ").alias("words"))

val remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")

val noStopWords = remover.transform(words)

val counts = noStopWords.select(explode($"filtered")).map(word =>(word, 1)).reduceByKey(+)

// from word -> num to num -> word

val mostCommon = counts.map(p => (p._2, p._1)).sortByKey(false, 1)

//save result to hdfs

//mostCommon.saveAsTextFile("Peter_Final")

//show top 5

mostCommon.take(5)

results matching ""

    No results matching ""