2016-08-15 8 views
0

Giraph 응용 프로그램을 EMR에 실행 중입니다.수퍼 슬립을 완료하고 플러시 할 때 컨테이너가 사망하고 전체 응용 프로그램이 멈 춥니 다 - Giraph

저는 1 개의 마스터와 10 개의 슬레이브 클러스터를 사용하고 있습니다. 모두 m3.2xlarge 기계입니다.

기본적으로 응용 프로그램은 위키 피 디아의 스페인어 버전을 통해 BFS로 구성됩니다 (Giraph에서 피팅을 위해 위키피디아 정보가 적용됨).

나는 다음과 같은 방법으로 응용 프로그램을 실행합니다

/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.algoritmos.masivos.BusquedaDeCaminosNavegacionalesWikiquotesMasivo /tmp/vertices.txt 4 [email protected] 1 ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 10 -yh 11500 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.isStaticGraph=true,giraph.numInputThreads=4,giraph.numOutputThreads=4 

나는 성공적으로 3 supersteps와 응용 프로그램을 실행할 수 있습니다,하지만 난 4 supersteps을 수행하려는 경우, 응용 프로그램이 실패 컨테이너 살해 도착하고, 함께 휴식을 취하십시오.

16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-147.sa-east-1.compute.internal:9103 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1471231949464_0001_01_000005 
16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-145.sa-east-1.compute.internal:9103 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000009 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000011 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000004 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000010 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000006 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000007 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000008 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000005 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000002 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000012 
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000003 
16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=1 
16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000008, state=COMPLETE, exitStatus=143, diagnostics=Container [pid=4455,containerID=container_1471231949464_0001_01_000008] is running beyond physical memory limits. Current usage: 11.4 GB of 11.3 GB physical memory used; 12.6 GB of 56.3 GB virtual memory used. Killing container. 
Dump of the process-tree for container_1471231949464_0001_01_000008 : 
     |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE 
     |- 4459 4455 4455 4455 (java) 13568 5567 13419675648 2982187 java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1 
     |- 4455 2706 4455 4455 (bash) 0 0 115875840 807 /bin/bash -c java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1 1>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stdout.log 2>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stderr.log 

Container killed on request. Exit code is 143 
Container exited with a non-zero exit code 143 

16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: After completion of one conatiner. current status is: completedCount :1 containersToLaunch :11 successfulCount :0 failedCount :1 
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=7 
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000002, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:501) 
     at org.apache.hadoop.util.Shell.run(Shell.java:418) 
     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655) 
     at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200) 
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) 
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) 
     at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
     at java.lang.Thread.run(Thread.java:745) 


Container exited with a non-zero exit code 1 

16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000012, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:501) 
     at org.apache.hadoop.util.Shell.run(Shell.java:418) 
     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655) 
     at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200) 
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) 
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) 
     at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
     at java.lang.Thread.run(Thread.java:745) 


Container exited with a non-zero exit code 1 

16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000006, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:501) 
     at org.apache.hadoop.util.Shell.run(Shell.java:418) 
     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655) 
     at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200) 
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) 
     at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) 
     at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
     at java.lang.Thread.run(Thread.java:745) 


Container exited with a non-zero exit code 1 

16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000007, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
     at org.apache.hadoop.util.Shell.runCommand(Shell.java:501) 
     at org.apache.hadoop.util.Shell.run(Shell.java:418) 
     at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655) 
     at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200) 

그래서, 컨테이너 8의 메모리에 문제가 나타나지만, 여기에 용기 (8) (기억의 마지막 로그 라인이며,이 컨테이너의 다음 Giraph 응용 프로그램 관리자에서 검색

이 말한다 나는이 권리를 이해한다면

16/08/15 03:46:52 INFO graph.ComputeCallable: call: Computation took 23.90834 secs for 10 partitions on superstep 3. Flushing started 
16/08/15 03:46:52 INFO worker.BspServiceWorker: finishSuperstep: Waiting on all requests, superstep 3 Memory (free/total/max) = 4516.47M/10619.50M/10619.50M 
16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 1307 open requests, waiting for it to be <= 0, MBytes/sec received = 0.0029, MBytesReceived = 0.0678, ave received req MBytes = 0, secs waited = 23.332 
MBytes/sec sent = 143.2912, MBytesSent = 3343.4141, ave sent req MBytes = 0.4999, secs waited = 23.332 
16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: 548 requests for taskId=10, 504 requests for taskId=0, 251 requests for taskId=5, 1 requests for taskId=4, 1 requests for taskId=7, 1 requests for taskId=8, 

그래서, 컨테이너가 플러시을하기 전에 사용할 수 4516.47M를 가지고 있고, 그 일을 할 때, 그 4516.47M의 가능한 모든 소비, 때 더 원 : 그)를 살해됩니다 킬을 얻는다. Giraph AM이 이끄는가?

왜 그렇게 많은 메모리가 플러시를 필요로하는지 이해하지 못합니다. 기본적으로 그 결과를 다음 superstep 용 디스크에 저장합니까? 그래서 이론적으로는 메모리가 전혀 필요하지 않아야합니다.

답변

0

플러시 프로세스가 메모리를 소비 할 수있는 것처럼 보입니다. 각 컨테이너에 더 많은 메모리를 추가하는 것이 유일한 해결책이었습니다.