Loading data
Step 1. Download
download the text format of the Bible from https://sites.google.com/site/ruwach/bibletext
Step 2. Data extract
extract New Testament
Step 3. Data consolidation
Concatenate all section together into one Peter.txt file
cat *.txt > Peter.txt
Step 4. Load data into HDFS
hdfs dfs -put /root/Downloads/NewTestament/1_2Peter/Peter.txt hdfs://localhost:9000/Peter
Step 5. Data verification:
hdfs dfs -ls hdfs://localhost:9000/Peter
hdfs dfs -cat hdfs://localhost:9000/Peter
Step 6. Analysis in Spark
val input = sc.textFile("hdfs://localhost:9000/Peter")
val counts = input.flatMap(line => line.split(" ")).map(word =>(word, 1)).reduceByKey(+);
counts.saveAsTextFile("Peter_output")
hdfs dfs -tail hdfs://localhost:9000/user/root/Peter_output/part-00000
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.split(" "))
.map(word =>(word, 1))
.reduceByKey(+)
.sortByKey(true, 1)
counts.saveAsTextFile("Peter_SortedOutput2")
This produced same result as the very first practice on input.txt
run counts.
Reset Scala session
Bible word counting
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.split(" ")).map(p => (p,1)).reduceByKey(+).sortByKey(false, 1)
counts.saveAsTextFile("Peter_SortedOutput5")
hdfs dfs -cat hdfs://localhost:9000/user/root/Peter_SortedOutput5/part-00000
Final solution
(provided by StackOverflow Rockie Yang)
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.sql.functions.split
val lines = sc.textFile("hdfs://localhost:9000/Peter").map(.replaceAll(raw"A-Za-z0-9\s+", "").trim.toLowerCase).toDF("line")_
val words = lines.select(split($"line", " ").alias("words"))
val remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
val noStopWords = remover.transform(words)
val counts = noStopWords.select(explode($"filtered")).map(word =>(word, 1)).reduceByKey(+)
// from word -> num to num -> word
val mostCommon = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
//save result to hdfs
//mostCommon.saveAsTextFile("Peter_Final")
//show top 5
mostCommon.take(5)