开发者社区> 问答> 正文

分组表达式的相关子查询 - TreeNodeException:Binding属性,tree:count(1)#382L

假设我试图对一些由对(a和b值)组成的样本数据进行统计。有些对存在多次,有些则不存在。

spark.createDataFrame([

Row(a=5, b=10), Row(a=5, b=10), Row(a=5, b=10),
Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10), Row(a=6, b=10),
Row(a=5, b=11), Row(a=5, b=11),
Row(a=6, b=12), Row(a=6, b=12), Row(a=6, b=12), Row(a=6, b=12),
Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5), Row(a=5, b=5),

]).registerTempTable('mydata')
首先,我只是计算每对存在的频率:

spark.sql('''
SELECT a, b,

COUNT(*) as count

FROM mydata AS o
GROUP BY a, b
''').show()
输出:

a b count
6 12 4
5 5 7
6 10 6
5 10 3
5 11 2

现在,我想添加一个额外的列,其中包含一对存在的频率百分比与具有相同值的对的总数相比。为此,我尝试添加一个计算总数的相关子查询:

spark.sql('''
SELECT a, b,

COUNT(*) as count,
(COUNT(*) / (
    SELECT COUNT(*) FROM mydata AS i WHERE o.a = i.a
)) as percentage

FROM mydata AS o
GROUP BY a, b
''').show()
我期待的是:

a b count percentage
6 12 4 0.4 --> 10 pairs exist with a=6 --> 4/10 = 0.4
5 5 7 0.5833 --> 12 pairs exist with a=5 --> 7/12 =0.5833
6 10 6 0.6 --> ...
5 10 3 0.25
5 11 2 0.1666

我得到了什么:

py4j.protocol.Py4JJavaError: An error occurred while calling o371.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: count(1)#382L

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.catalyst.expressions.BindReferences

$$ anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) at org.apache.spark.sql.catalyst.expressions.BindReferences $$

anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
[...]
Caused by: java.lang.RuntimeException: Couldn't find count(1)#382L in [a#305L,b#306L,count(1)#379L]

at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.expressions.BindReferences

$$ anonfun$bindReference$1 $$

anonfun$applyOrElse$1.apply(BoundAttribute.scala:97)

at org.apache.spark.sql.catalyst.expressions.BindReferences

$$ anonfun$bindReference$1 $$

anonfun$applyOrElse$1.apply(BoundAttribute.scala:91)

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
... 80 more

这听起来有点令人困惑 - 某种程度上pyspark想要访问内部联接的计数?

我的子查询语法有什么问题吗?

展开
收起
社区小助手 2018-12-05 14:58:19 4145 0
1 条回答
写回答
取消 提交回答
  • 社区小助手是spark中国社区的管理员,我会定期更新直播回顾等资料和文章干货,还整合了大家在钉群提出的有关spark的问题及回答。

    第一个表中,您可以使用窗口函数计算百分比; sum(count) over (partition by a)将总结的count由a与所述结果的长度不减小,这允许用户通过直接在另一列来划分:

    spark.sql('''

    SELECT a, b,
        COUNT(*) as count
    FROM mydata AS o
    GROUP BY a, b

    ''').registerTempTable('count')

    spark.sql('''

    SELECT *, 
           count / sum(count) over (partition by a) as percentage 
    FROM count
    ''').show()
    abcountpercentage
    61240.4
    61060.6
    5570.5833333333333334
    51030.25
    51120.16666666666666666
    2019-07-17 23:18:24
    赞同 展开评论 打赏
问答地址:
问答排行榜
最热
最新

相关电子书

更多
对 2000 多亿条数据做一次 group by 需要多久? 立即下载
对2000多亿条数据做一次Group By 需要多久 立即下载
Lazy-Join Optimizations withou 立即下载