开发者社区> 问答> 正文

如何将数据集拆分为两个具有唯一和重复行的数据集?

我想在Spark scala Dataframe中获取重复记录。例如,我想根据3列(例如“id”,“name”,“age”)获取重复值。条件部分包含任何列数(动态输入)。基于列值,我想采取重复记录。

以下代码我试过了。我试过的只有一个属性。如果不止一列,我不知道该怎么做。

我的代码:

var s= "age|id|name " // Note- This is dynamic input. so it will increase or decrease
var columnNames= s.replace('|', ',')

val findDuplicateRecordsDF= spark.sql("SELECT * FROM " + dbname + "." + tablename)
findDuplicateRecordsDF.show()
findDuplicateRecordsDF.withColumn("count", count("*")

  .over(Window.partitionBy($"id"))) // here how to add more than one column?(Dynamic input) 
  .where($"count">1)
  .show()

输入数据帧:(findDuplicateRecordsDF.show())

   --------------------------------------------------------
   |  id   |  name | age |  phone      | email_id          |
   |-------------------------------------------------------|  
   |  3    | sam   | 23  |  9876543210 | sam@yahoo.com     | 
   |  7    | ram   | 27  |  8765432190 | ram@gmail.com     |
   |  3    | sam   | 28  |  9876543210 | sam@yahoo.com     | 
   |  6    | haris | 30  |  6543210777 | haris@gmail.com   |
   |  9    | ram   | 27  |  8765432130 | ram94@gmail.com   |
   |  6    | haris | 24  |  6543210777 | haris@gmail.com   | 
   |  4    | karthi| 26  |  4321066666 | karthi@gmail.com  | 
   --------------------------------------------------------

在这里,我将基于4列(身份证,姓名,电话,电子邮件)获取重复记录。以上是示例数据帧。原始数据框包含任何列数。

输出数据帧应该是

重复记录输出

   --------------------------------------------------------
   |  id   |  name | age |  phone      | email_id          |
   |-------------------------------------------------------|  
   |  3    | sam   | 23  |  9876543210 | sam@yahoo.com     | 
   |  3    | sam   | 28  |  9876543210 | sam@yahoo.com     | 
   |  6    | haris | 30  |  6543210777 | haris@gmail.com   |
   |  6    | haris | 24  |  6543210777 | haris@gmail.com   | 
    --------------------------------------------------------

唯一记录数据框输出:

  --------------------------------------------------------
 |  id   |  name | age |  phone      | email_id          |
 |-------------------------------------------------------|  
 |  7    | ram   | 27  |  8765432190 | ram@gmail.com     |
 |  9    | ram   | 27  |  8765432130 | ram94@gmail.com   |
 |  4    | karthi| 26  |  4321066666 | karthi@gmail.com  | 
  --------------------------------------------------------

展开
收起
社区小助手 2018-12-12 10:57:37 2868 0
1 条回答
写回答
取消 提交回答
  • 社区小助手是spark中国社区的管理员,我会定期更新直播回顾等资料和文章干货,还整合了大家在钉群提出的有关spark的问题及回答。

    可以使用窗口功能。看一下这个

    scala> val df = Seq((3,"sam",23,"9876543210","sam@yahoo.com"),(7,"ram",27,"8765432190","ram@gmail.com"),(3,"sam",28,"9876543210","sam@yahoo.com"),(6,"haris",30,"6543210777","haris@gmail.com"),(9,"ram",27,"8765432130","ram94@gmail.com"),(6,"haris",24,"6543210777","haris@gmail.com"),(4,"karthi",26,"4321066666","karthi@gmail.com")).toDF("id","name","age","phone","email_id")
    df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 3 more fields]

    scala> val dup_cols = List("id","name","phone","email_id");
    dup_cols: List[String] = List(id, name, phone, email_id)

    scala> df.createOrReplaceTempView("contact")

    scala> val dup_cols_qry = dup_cols.mkString(" count(*) over(partition by ", "," , " ) as cnt ")
    dup_cols_qry: String = " count(*) over(partition by id,name,phone,email_id ) as cnt "

    scala> val df2 = spark.sql("select *,"+ dup_cols_qry + " from contact ")
    df2: org.apache.spark.sql.DataFrame = [id: int, name: string ... 4 more fields]

    scala> df2.show(false)
    id name age phone email_id cnt
    4 karthi 26 4321066666 karthi@gmail.com 1
    7 ram 27 8765432190 ram@gmail.com 1
    9 ram 27 8765432130 ram94@gmail.com 1
    3 sam 23 9876543210 sam@yahoo.com 2
    3 sam 28 9876543210 sam@yahoo.com 2
    6 haris 30 6543210777 haris@gmail.com 2
    6 haris 24 6543210777 haris@gmail.com 2

    scala> df2.createOrReplaceTempView("contact2")
    //重复

    scala> spark.sql("select " + dup_cols.mkString(",") + " from contact2 where cnt = 2").show
    id name phone email_id
    3 sam 9876543210 sam@yahoo.com
    3 sam 9876543210 sam@yahoo.com
    6 haris 6543210777 haris@gmail.com
    6 haris 6543210777 haris@gmail.com

    // 独特

    scala> spark.sql("select " + dup_cols.mkString(",") + " from contact2 where cnt = 1").show
    id name phone email_id
    4 karthi 4321066666 karthi@gmail.com
    7 ram 8765432190 ram@gmail.com
    9 ram 8765432130 ram94@gmail.com
    2019-07-17 23:20:00
    赞同 展开评论 打赏
问答分类:
问答地址:
问答排行榜
最热
最新

相关电子书

更多
低代码开发师(初级)实战教程 立即下载
冬季实战营第三期:MySQL数据库进阶实战 立即下载
阿里巴巴DevOps 最佳实践手册 立即下载