hdfs

#hdfs#

已有4人关注此标签

内容分类

游客izljdlkgbdwfc

flink sql 支持checkpoints吗?

使用flink sql进行实时计算(部署模式是on yarn)。想用到checkpoint,在flink-conf.yaml配置了: state.backend: filesystemstate.checkpoints.dir: hdfs:///flink/flink-checkpointsstate.savepoints.dir: hdfs:///flink/flink-savepointsstate.checkpoints.num-retained: 10 现在有两个问题:1.checkpoint在hdfs只在初始的时候生成了一个文件夹2.使用savepoint无法记录数据状态:如:简单的wordcount无法记录累加值。 想请问大神sqlclient的checkpoints是不是没有像代码那些完全开放。如果想实现这个功能是不是要进行二次开发。感谢

游客iwhrjhvjoyqts

通过spark-thriftserver读取hive表执行sql时,tasks 数量怎么设置

我在使用spark-thriftserver的方式,通过beeline执行sql的时候,thriftserver会扫描所查询hive表的所有分区(hdfs上面的路径)然后有多少个路径,就会自动生成多少个task,这个task的数量可以调整吗?或者这个流程怎么优化?

八戒八戒2333

flink用IDEA本地运行可以读取HDFS数据,然后把项目打包提交到flink集群,无法读取HDFS数据,出现以下错误,这是为何?

The program finished with the following exception: org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: 74a2d820909fee963c4dea371b5c236c) at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:483) at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66) at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654) at org.myflink.quickstart.WordCount$.main(WordCount.scala:20) at org.myflink.quickstart.WordCount.main(WordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:423) at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813) at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050) at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126) Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265) ... 19 more Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'hdfs'. The scheme is not directly supported by Flink and no Hadoop file system to support this scheme could be loaded. at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:403) at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:318) at org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction.run(ContinuousFileMonitoringFunction.java:196) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Hadoop is not in the classpath/dependencies. at org.apache.flink.core.fs.UnsupportedSchemeFactory.create(UnsupportedSchemeFactory.java:64) at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:399) ... 8 more 本地bashrc已经配置了 HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop flink-conf.yaml也已经做了一下配置 env.hadoop.conf.dir=/usr/local/hadoop/etc/hadoop 请问这是什么原因呀?

社区小助手

Spark 【问答合集】

如何使用spark将kafka主题中的writeStream数据写入hdfs?https://yq.aliyun.com/ask/493211当Spark在S3上读取大数据集时,在“停机时间”期间发生了什么?https://yq.aliyun.com/ask/493212从Redshift读入Spark Dataframe(Spark-Redshift模块)https://yq.aliyun.com/ask/493215在初始化spark上下文后,在运行时更改pyspark的hadoop配置中的aws凭据https://yq.aliyun.com/ask/493217Window.rowsBetween - 仅考虑满足特定条件的行(例如,不为null)https://yq.aliyun.com/ask/493220spark的RDD内容直接用saveAsTextFile保存到hdfs时会出现中文乱码现象,但在控制台用foreach打印该RDD数据显示是正常的,该怎么解决呢?https://yq.aliyun.com/ask/494418请问一下如何能查看spark struct streaming内存使用情况呢?https://yq.aliyun.com/ask/494417使用spark 2.3 structed streaming 时 checkpoint 频繁在HDFS写小文件,块数到达百万级别 ,这个怎么优化下?https://yq.aliyun.com/ask/494415请教大家一个问题,spark stream连kafka,在web页面的stream标签,显示好多batch处于queued状态,这些batch是已经把数据从kafka读取进rdd,等待处理,还是还没有从kafka读取数进rdd?https://yq.aliyun.com/ask/493702为什么我使用 dropDuplicates()函数报错Caused by: java.lang.NoSuchMethodError: org.codehaus.commons.compiler.Location.(Ljava/lang/String;II)V ?https://yq.aliyun.com/ask/493700请教一下,我hive中数据大小为16g,通过importtsv生成了hfile 文件,导入到hbase中了,数据变成130多g,还有什么更好的办法吗?https://yq.aliyun.com/ask/493698jdbc 连接spark thrift server 如何获取日志?https://yq.aliyun.com/ask/493582Spark如何从一行中仅提取Json数据?https://yq.aliyun.com/ask/493581pyspark - 在json流数据中找到max和min usign createDataFramehttps://yq.aliyun.com/ask/493234如何计算和获取Spark Dataframe中唯一ID的值总和?https://yq.aliyun.com/ask/493231如何将csv目录加载到hdfs作为parquet?https://yq.aliyun.com/ask/493224无法使用Spark在Datastax上初始化图形https://yq.aliyun.com/ask/493222使用PySpark计算每个窗口的用户数https://yq.aliyun.com/ask/493221sql语句不支持delete操作,如果我想执行delete操作该怎么办?https://yq.aliyun.com/ask/494420spark streaming 和 kafka ,打成jar包后((相关第三方依赖也在里面)),放到集群上总是报StringDecoder 找不到classhttps://yq.aliyun.com/ask/494421json字符串中有重名但大小写不同的key,使用play.api.libs.json.Json.parse解析json没有报错,但是spark-sql使用org.openx.data.jsonserde.JsonSerDe时,会自动将key转为小写,然后putOnce函数报错Duplicate keyhttps://yq.aliyun.com/ask/494423spark DataFrame写入HDFS怎么压缩?https://yq.aliyun.com/ask/495552使用Spark On Hive时,动态的将数据插入到Hive中,但是在Hive的数据表下会有很多文件,这个可以怎么设置一下呢?https://yq.aliyun.com/ask/495927 技术交流群 Apache Spark中国技术交流群 (钉钉扫码加入)

游客4c3lpvjn33j5i

雅拓

flink在执行job时checkpoint报错

Flink版本是1.8standardalone cluster模式3台机器执行的时候一直出hdfs权限错误,查看了hdfs目录权限好像是没问题 AsynchronousException{java.lang.Exception: Could not materialize checkpoint 27 for operator Source: Custom Source (2/2).} at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.Exception: Could not materialize checkpoint 27 for operator Source: Custom Source (2/2). at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942) ... 6 more Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to null in order to obtain the stream state handle at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:394) at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853) ... 5 more Caused by: java.io.IOException: Could not flush and close the file system output stream to null in order to obtain the stream state handle at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:326) at org.apache.flink.runtime.state.DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackendSnapshotStrategy.java:179) at org.apache.flink.runtime.state.DefaultOperatorStateBackendSnapshotStrategy$1.callInternal(DefaultOperatorStateBackendSnapshotStrategy.java:108) at org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:75) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:391) ... 7 more Caused by: java.io.IOException: Could not open output stream for state backend at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.createStream(FsCheckpointStreamFactory.java:359) at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.flush(FsCheckpointStreamFactory.java:226) at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:301) ... 12 more Caused by: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/flink1.8/flink-checkpoints/e6db6b009944904d030f4ac253fcb59a/chk-27/ba5ce9a7-fb30-4a87-8898-2cc6b767efab":hdfs:hbase:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:292) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:213) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1745) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1729) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1712) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2602) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2537) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2421) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:621) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1940) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1841) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1698) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1633) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) at org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:891) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:788) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141) at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37) at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.create(SafetyNetWrapperFileSystem.java:126) at org.apache.flink.core.fs.EntropyInjector.createEntropyAware(EntropyInjector.java:61) at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.createStream(FsCheckpointStreamFactory.java:348) ... 14 more Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode="/flink1.8/flink-checkpoints/e6db6b009944904d030f4ac253fcb59a/chk-27/ba5ce9a7-fb30-4a87-8898-2cc6b767efab":hdfs:hbase:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:292) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:213) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1745) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1729) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1712) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2602) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2537) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2421) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:621) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:397) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1940) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) at org.apache.hadoop.ipc.Client.call(Client.java:1476) at org.apache.hadoop.ipc.Client.call(Client.java:1413) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy10.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy11.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1836) ... 29 more

开源大数据EMR

请教一下,delta是不是可以理解为,是基于hdfs的行级别的数据库?然后对于更新数据对于hdfs产生小文件的解决方案是他会提供merge机制?

delta是不是可以理解为,是基于hdfs的行级别的数据库?然后对于更新数据对于hdfs产生小文件的解决方案是他会提供merge机制?

开源大数据EMR

Hive/Impala 作业读取 SparkSQL 导入的 Parquet 表报错

Hive/Impala 作业读取 SparkSQL 导入的 Parquet 表报错(表包含 Decimal 格式的列):Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://…/…/part-00000-xxx.snappy.parquet

mz111

Flink on Yarn启动的TaskManager只有一个

Flink On Yarn 模式,单个任务提交的时候,任务提交成功了,但是我的TaskManager只启动了一个,其他的TaskManager都没启动起来,当然资源也没法用,这是为什么啊???我的启动命令是这样的:flink -m yarn-cluster -yn 3 sse.jar命名要启动三个TaskManager,但是只启动了一个,我的配置如下: fs.hdfs.hadoopconf: /etc/hadoop/conf akka.ask.timeout:20s # The heap size for the JobManager JVM jobmanager.heap.size: 2048m # The heap size for the TaskManager JVM taskmanager.heap.size: 4096m # The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline. taskmanager.numberOfTaskSlots: 12 slaves: 172.16.0.18 172.16.0.19 172.16.0.20 172.16.0.17

小六码奴

无法在EMR spark群集中运行python作业

我正在尝试向AWS EMR spark集群提交python作业。 我在spark-submit选项部分中的设置如下: --master yarn --driver-memory 4g --executor-memory 2g 但是,我在工作期间遇到了一个失败的案例。 以下是错误日志文件: 19/04/09 10:40:25 INFO RMProxy: Connecting to ResourceManager at ip-172-31-53-241.ec2.internal/172.31.53.241:803219/04/09 10:40:26 INFO Client: Requesting a new application from cluster with 3 NodeManagers19/04/09 10:40:26 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)19/04/09 10:40:26 INFO Client: Will allocate AM container, with 4505 MB memory including 409 MB overhead19/04/09 10:40:26 INFO Client: Setting up container launch context for our AM19/04/09 10:40:26 INFO Client: Setting up the launch environment for our AM container19/04/09 10:40:26 INFO Client: Preparing resources for our AM container19/04/09 10:40:26 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.19/04/09 10:40:29 INFO Client: Uploading resource file:/mnt/tmp/spark-a8e941b7-f20f-46e5-8b2d-05c52785bd22/__spark_libs__3200812915608084660.zip -> hdfs://ip-172-31-53-241.ec2.internal:8020/user/hadoop/.sparkStaging/application_1554806206610_0001/__spark_libs__3200812915608084660.zip19/04/09 10:40:32 INFO Client: Uploading resource s3://spark-yaowen/labelp.py -> hdfs://ip-172-31-53-241.ec2.internal:8020/user/hadoop/.sparkStaging/application_1554806206610_0001/labelp.py19/04/09 10:40:32 INFO S3NativeFileSystem: Opening 's3://spark-yaowen/labelp.py' for reading19/04/09 10:40:32 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-172-31-53-241.ec2.internal:8020/user/hadoop/.sparkStaging/application_1554806206610_0001/pyspark.zip19/04/09 10:40:33 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.7-src.zip -> hdfs://ip-172-31-53-241.ec2.internal:8020/user/hadoop/.sparkStaging/application_1554806206610_0001/py4j-0.10.7-src.zip19/04/09 10:40:34 INFO Client: Uploading resource file:/mnt/tmp/spark-a8e941b7-f20f-46e5-8b2d-05c52785bd22/__spark_conf__6746542371431989978.zip -> hdfs://ip-172-31-53-241.ec2.internal:8020/user/hadoop/.sparkStaging/application_1554806206610_0001/__spark_conf__.zip19/04/09 10:40:34 INFO SecurityManager: Changing view acls to: hadoop19/04/09 10:40:34 INFO SecurityManager: Changing modify acls to: hadoop19/04/09 10:40:34 INFO SecurityManager: Changing view acls groups to: 19/04/09 10:40:34 INFO SecurityManager: Changing modify acls groups to: 19/04/09 10:40:34 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()19/04/09 10:40:36 INFO Client: Submitting application application_1554806206610_0001 to ResourceManager19/04/09 10:40:37 INFO YarnClientImpl: Submitted application application_1554806206610_000119/04/09 10:40:38 INFO Client: Application report for application_1554806206610_0001 (state: ACCEPTED)19/04/09 10:40:38 INFO Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1554806436561 final status: UNDEFINED tracking URL: http://ip-172-31-53-241.ec2.internal:20888/proxy/application_1554806206610_0001/ user: hadoop 19/04/09 10:40:39 INFO Client: Application report for application_1554806206610_0001 (state: ACCEPTED)19/04/09 10:40:40 INFO Client: Application report for application_1554806206610_0001 (state: ACCEPTED)19/04/09 10:40:41 INFO Client: Application report for application_1554806206610_0001 (state: ACCEPTED)19/04/09 10:40:42 INFO Client: Application report for application_1554806206610_0001 (state: ACCEPTED)19/04/09 10:40:43 INFO Client: Application report for application_15548062066

小六码奴

hadoop将结果从hdfs复制到S3

我想从HDFS复制结果到S3,但有一些问题 这是代码(--steps) { "Name":"AAAAA", "Type":"CUSTOM_JAR", "Jar":"command-runner.jar", "ActionOnFailure":"CONTINUE", "Args": [ "s3-dist-cp", "--src", "hdfs:///seqaddid_output", "--dest", "s3://wuda-notebook/seqaddid" ] }这是日志: 2019-04-12 03:01:23,571 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Running with args: -libjars /usr/share/aws/emr/s3-dist-cp/lib/commons-httpclient-3.1.jar,/usr/share/aws/emr/s3-dist-cp/lib/commons-logging-1.0.4.jar,/usr/share/aws/emr/s3-dist-cp/lib/guava-18.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp-2.10.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar --src hdfs:///seqaddid_output/ --dest s3://wuda-notebook/seqaddid 2019-04-12 03:01:24,196 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): S3DistCp args: --src hdfs:///seqaddid_output/ --dest s3://wuda-notebook/seqaddid 2019-04-12 03:01:24,203 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Using output path 'hdfs:/tmp/4f93d497-fade-4c78-86b9-59fc3da35b4e/output' 2019-04-12 03:01:24,263 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): GET http://169.254.169.254/latest/meta-data/placement/availability-zone result: us-east-1f 2019-04-12 03:01:24,664 FATAL com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system java.io.FileNotFoundException: File does not exist: hdfs:/seqaddid_output at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1444) at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1437) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1452) at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:795) at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:234) at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

逸新

hadoop可以同时oss和hdfs吗

在一个hadoop集群中,是否可以同时支持oss和hdfs两个文件系统并且使用M/R可以同时读取两个文件系统的数据?

游客saqcxjoyi2n6i

关于spark分区什么时候进行的问题?

spark分区是在maptask 到reduceTask的时候进行分区,还是说在sc.textFile就开始分区了呢 我测试了一下,sc.textFile默认分区后我把每个分区的内容打印出来发现不是hash算法分区,但是当我经过了shuffle算子之后再打印各个分区的内容就是按照hash算法分区 所以很疑惑,如果是sc.textFile就开始分区,那么假设3个block块,我在sc.textFile就指定5个分区,那就得将3个block块分成5个分区,那会很占用内存和网络资源(map取各个block块中的某一个分区),感觉有点不太合理啊,然后再经过shuffle算子,再次分区,感觉很慢啊; 我在想是不是一开始sc.textFile在读取hdfs的数据时,按照平均的方式给每个一分区数据(例如:3个block块共384MB,5个分区就是各76.8Mb,每个map读取这76.bMB数据),然后在经过shuffle算子的时候才开始按照hash算法分区,生成文件,再由reduce取各个节点的分区值,这样也能说的通,最后五个part-0000文件,shuffle过程的桶也是5*5=25

hbase小能手

连接file或者hdfs出现问题

各位大神,这个问题有遇到过的吗?连接file或者hdfs都会这样

游客4c3lpvjn33j5i

hbase小能手

HBase写入数据报错“There is a hole in the region chain”

“There is a hole in the region chain between and . You need to create a new .regioninfo and region dir in hdfs to plug the hole.

社区小助手

大佬们,请教下structed streaming 时 checkpoint 频繁在HDFS写小文件,这个一般是怎么处理的?

大佬们,请教下structed streaming 时 checkpoint 频繁在HDFS写小文件,这个一般是怎么处理的?

社区小助手

spark DataFrame写入HDFS怎么压缩

spark DataFrame写入HDFS怎么压缩?写成txt文件的格式

李博 bluemind

hdfs对于百兆内的视频读写性能如何呢?

hdfs对于百兆内的视频读写性能如何呢?

指转流年

在阿里云部署成功hadoop后无法访问hdfs页面

hadoop配置成功了 也启动了 hdfs页面访问不到 hadoop的版本是3.2.0 端口监听页面如图所示 急急急 大哥们帮帮忙看一下