Step by step:

1. import StopWordsRemover and split module

import org.apache.spark.ml.feature.StopWordsRemover

import org.apache.spark.sql.functions.split

2. data transform and cleansing

val lines = sc.textFile("hdfs://localhost:9000/Peter").map(.replaceAll(raw"A-Za-z0-9\s+", "").trim.toLowerCase).toDF("line")_

2.1. define variable

val words = lines.select(split($"line", " ").alias("words"))

2.2 create a remover variable to remove all "stopwords" from raw input

create a remover to remove all the "StopWords" from val words

val remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered")

2.3 define counts and calculate counts

val counts = noStopWords.select(explode($"filtered")).map(word =>(word, 1)).reduceByKey(+)

2.4 sort the result

val mostCommon = counts.map(p => (p._2, p._1)).sortByKey(false, 1)

2.5 save the result to hdfs

mostCommon.saveAsTextFile("Peter_Final")

2.6 output the result from hdfs to a text file:

hdfs dfs -cat hdfs://localhost:9000/user/root/Peter_Final/part-00000 > Peter_Final.txt

2.7 show the final sorted result:

head -n 100 Peter_Final.txt

to fix: numbers are still counted

to do: loop all the Bible books and merge them to a single text file and run the solution again

results matching ""

    No results matching ""