开发者社区> 问答> 正文

如果列表中存在,则从列中删除单词

我有一个带有'text'列的数据框,其中有许多行包含英文句子。

文本

It is evening
Good morning
Hello everyone
What is your name
I'll see you tomorrow
我有一个List类型的变量,它有一些单词,如

val removeList = List("Hello", "evening", "because", "is")
我想删除removeList中存在的列文本中的所有单词。

所以我的输出应该是

It
Good morning
everyone
What your name
I'll see you tomorrow
如何使用Spark Scala执行此操作。

我写了一个像这样的代码:

val stopWordsList = List("Hello", "evening", "because", "is");
val df3 = sqlContext.sql("SELECT text FROM table");
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));

def cleanText(x:String, stopWordsList:List[String]):Any = {
for(str <- stopWordsList) {

if(x.contains(str)) {
  x.replaceAll(str, "")
}

}
}
但我收到了错误

Error:(44, 12) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val df4 = df3.map(x => cleanText(x.mkString, stopWordsList));

Error:(44, 12) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[String])org.apache.spark.sql.Dataset[String].
未指定的值参数证据$ 6。val df4 = df3.map(x => cleanText(x.mkString,stopWordsList));

展开
收起
社区小助手 2018-12-12 14:01:53 1800 0
1 条回答
写回答
取消 提交回答
  • 社区小助手是spark中国社区的管理员,我会定期更新直播回顾等资料和文章干货,还整合了大家在钉群提出的有关spark的问题及回答。

    检查这个df和rdd方式。

    val df = Seq(("It is evening"),("Good morning"),("Hello everyone"),("What is your name"),("I'll see you tomorrow")).toDF("data")
    val removeList = List("Hello", "evening", "because", "is")
    val rdd2 = df.rdd.map{ x=> {val p = x.getAsString ; val k = removeList.foldLeft(p) ( (p,t) => p.replaceAll("\b"+t+"\b","") ) ; Row(x(0),k) } }
    spark.createDataFrame(rdd2, df.schema.add(StructField("new1",StringType))).show(false)
    输出:

    data new1
    It is evening It
    Good morning Good morning
    Hello everyone everyone
    What is your name What your name
    I'll see you tomorrow I'll see you tomorrow
    2019-07-17 23:20:09
    赞同 展开评论 打赏
问答分类:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
低代码开发师(初级)实战教程 立即下载
冬季实战营第三期:MySQL数据库进阶实战 立即下载
阿里巴巴DevOps 最佳实践手册 立即下载