2017-09-21 5 views
4

독립형 스파크 클러스터에 3 개의 슬레이브가 있습니다. 각 슬레이브에는 48GB의 RAM이 있습니다. 내 집행에 RAM의 (예를 들어, 32기가바이트 이상) 31기가바이트 이상을 할당하는 경우 :32GB 이상의 메모리를 가진 스파크 작업자가 치명적인 오류가 발생했습니다.

.config("spark.executor.memory", "44g") 

실행 프로그램은 두 개의 큰 Dataframes의 가입 기간 동안 많은 정보없이 종료되었습니다.

17/09/21 12:34:18 INFO Master: Removing executor app-20170921123240-0000/0 because it is EXITED 
17/09/21 12:34:18 INFO Master: Launching executor app-20170921123240-0000/3 on worker worker-20170921123014-152.83.247.92-33705 

:

17/09/21 12:34:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20170921123240-0000/3 on hostPort XXX.XXX.XXX.92:33705 with 6 cores, 44.0 GB RAM 
17/09/21 12:34:18 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 124, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
17/09/21 12:34:18 WARN TaskSetManager: Lost task 5.0 in stage 7.0 (TID 115, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
17/09/21 12:34:18 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 127, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
17/09/21 12:34:18 WARN TaskSetManager: Lost task 8.0 in stage 7.0 (TID 118, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
17/09/21 12:34:18 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 112, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
17/09/21 12:34:18 WARN TaskSetManager: Lost task 11.0 in stage 7.0 (TID 121, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 
17/09/21 12:34:18 INFO DAGScheduler: Executor lost: 0 (epoch 5) 
17/09/21 12:34:18 INFO BlockManagerMaster: Removal of executor 0 requested 
17/09/21 12:34:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0 
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster. 
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_2 ! 
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_11 ! 
17/09/21 12:34:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170921123240-0000/3 is now RUNNING 
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_5 ! 
17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_8 ! 
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, XXX.XXX.XXX, 34840, None) 
17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster. 
17/09/21 12:34:18 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor 

는 스파크 마스터의 로그 메시지는 실행 프로그램이 "종료"했다 후 재발 것을 보여 주었다 : 슬레이브 드라이버에서 출력 메시지는 "셔플에 대한 출력 위치 누락"했다 스파크 작업자 로그인 메시지 실행기 유일한 단서는 치명적인 오류를 나타내는 응용 프로그램의 에러 로그에 보인다 134

17/09/21 12:34:18 INFO Worker: Executor app-20170921123240-0000/0 finished with state EXITED message Command exited with code 134 exitStatus 134 

가 DETE왔다 코드로 종료되었습니다 JRE에 의해 cted :

# 
# A fatal error has been detected by the Java Runtime Environment: 
# 
# SIGSEGV (0xb) at pc=0x00007fdec0c92a73, pid=11300, tid=0x00007fd3a6951700 
# 
# JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11) 
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64) 
# Problematic frame: 
# V [libjvm.so+0x3ffa73] CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int)+0x5e3 
# 
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again 
# 
# If you would like to submit a bug report, please visit: 
# http://bugreport.java.com/bugreport/crash.jsp 
# 

--------------- T H R E A D --------------- 

Current thread (0x0000000001c9e800): GCTaskThread [stack: 0x00007fd3a6851000,0x00007fd3a6952000] [id=11308] 

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000008 

각 실행자에게 31GB RAM (또는 그 이하)을 할당하는 한, 프로그램은 정상적으로 작동합니다. 그런 문제가 생긴 사람이 있습니까?

답변

0

44GB는 실제로 Java가 객체 참조를 저장하는 방법으로 인해 31GB보다 작은 사용 가능한 힙을 제공 할 수 있습니다. 32GB 이상의 힙 크기의 경우 JVM이 64 비트 객체 참조로 전환해야하므로 모든 객체가 더 많은 공간을 차지합니다 . 자세한 내용은 여기를 참조하십시오. http://java-performance.info/over-32g-heap-java/

내 경험에 비추어 볼 때, 32GB 미만으로 유지하거나 훨씬 높게 (예 : 50GB) 이동해야합니다. 일반적으로 32GB 힙이 적은 여러 JVM을 사용하는 것이 비용면에서 효율적입니다. 48GB RAM을 사용하면 31GB 힙을 사용할 수 있습니다.

+0

설명해 주셔서 감사합니다. – Jonathan