Step by step:
1. import StopWordsRemover and split module
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.sql.functions.split
2. data transform and cleansing
val lines = sc.textFile("hdfs://localhost:9000/Peter").map(.replaceAll(raw"A-Za-z0-9\s+", "").trim.toLowerCase).toDF("line")_
2.1. define variable
val words = lines.select(split($"line", " ").alias("words"))
2.2 create a remover variable to remove all "stopwords" from raw input
create a remover to remove all the "StopWords" from val words
val remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")
2.3 define counts and calculate counts
val counts = noStopWords.select(explode($"filtered")).map(word =>(word, 1)).reduceByKey(+)
2.4 sort the result
val mostCommon = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
2.5 save the result to hdfs
mostCommon.saveAsTextFile("Peter_Final")
2.6 output the result from hdfs to a text file:
hdfs dfs -cat hdfs://localhost:9000/user/root/Peter_Final/part-00000 > Peter_Final.txt
2.7 show the final sorted result:
head -n 100 Peter_Final.txt
to fix: numbers are still counted
to do: loop all the Bible books and merge them to a single text file and run the solution again