2016-09-15 4 views
6

Spark 클러스터의 여러 노드에서 Apache Spark을 사용하여 Windows 배치 파일을 실행해야한다는 요구 사항이 있습니다.Apache Spark에서 파이프를 통해 Windows 배치 파일 실행

Apache Spark의 배관 개념을 사용하여 동일한 작업을 수행 할 수 있습니까?

필자는 Ubuntu 머신의 Spark에서 Piping을 사용하여 쉘 파일을 실행했습니다. 같은 일을 내 아래 코드는 잘 실행 :

data = ["hi","hello","how","are","you"] 
distScript = "/home/aawasthi/echo.sh" 
distScriptName = "echo.sh" 
sc.addFile(distScript) 
RDDdata = sc.parallelize(data) 
print RDDdata.pipe(SparkFiles.get(distScriptName)).collect() 

나는 윈도우 머신 가진 불꽃에서 Windows 배치 파일 설치 (하둡 2.6 1.6 사전 구축)를 실행하는 동일한 코드를 적응하기 위해 노력했다. 하지만 sc.addFile 단계에서 오류가 발생합니다. 코드는 다음과 같습니다 : 스파크에 의해 슬로우

batchFile = "D:/spark-1.6.2-bin-hadoop2.6/data/OpenCV/runOpenCv" 
batchFileName = "runOpenCv" 
sc.addFile(batchFile) 

오류는 다음과 같습니다 : 배치 파일이 지정된 위치에 존재

Py4JJavaError        Traceback (most recent call last) 
<ipython-input-11-9e13c265cbae> in <module>() 
----> 1 sc.addFile(batchFile)` 

Py4JJavaError: An error occurred while calling o160.addFile. 
: java.io.FileNotFoundException: Added file D:/spark-1.6.2-bin-hadoop2.6/data/OpenCV/runOpenCv does not exist. 
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1364) 
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:209) 
    at java.lang.Thread.run(Thread.java:745) 

있지만.

UPDATE : 파일 경로의 시작에서 batchFile & batchFileName & file:///의 확장으로
추가 .bat. 수정 된 코드는 다음과 같습니다

from pyspark import SparkFiles 
from pyspark import SparkContext  
sc  
batchFile = "file:///D:/spark-1.6.2-bin-hadoop2.6/data/OpenCV/runOpenCv.bat" 
batchFileName = "runOpenCv.bat" 
sc.addFile(batchFile) 
RDDdata = sc.parallelize(["hi","hello"]) 
print SparkFiles.get("runOpenCv.bat") 
print RDDdata.pipe(SparkFiles.get(batchFileName)).collect() 

지금은 addFile 단계에서 오류를주고, print SparkFiles.get("runOpenCv.bat") 인쇄 경로
C:\Users\abhilash.awasthi\AppData\Local\Temp\spark-c0f383b1-8365-4840-bd0f-e7eb46cc6794\userFiles-69051066-f18c-45dc-9610-59cbde0d77fe\runOpenCv.bat
에게 그래서 파일을 추가하지 않습니다. 그러나 코드의 마지막 단계는 아래의 오류가 발생합니다 :

Py4JJavaError        Traceback (most recent call last) 
<ipython-input-6-bf2b8aea3ef0> in <module>() 
----> 1 print RDDdata.pipe(SparkFiles.get(batchFileName)).collect() 

D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.pyc in collect(self) 
    769   """ 
    770   with SCCallSiteSync(self.context) as css: 
--> 771    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 
    772   return list(_load_from_socket(port, self._jrdd_deserializer)) 
    773 

D:\spark-1.6.2-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py in __call__(self, *args) 
    811   answer = self.gateway_client.send_command(command) 
    812   return_value = get_return_value(
--> 813    answer, self.gateway_client, self.target_id, self.name) 
    814 
    815   for temp_arg in temp_args: 

D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\sql\utils.pyc in deco(*a, **kw) 
    43  def deco(*a, **kw): 
    44   try: 
---> 45    return f(*a, **kw) 
    46   except py4j.protocol.Py4JJavaError as e: 
    47    s = e.java_exception.toString() 

D:\spark-1.6.2-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 
    306     raise Py4JJavaError(
    307      "An error occurred while calling {0}{1}{2}.\n". 
--> 308      format(target_id, ".", name), value) 
    309    else: 
    310     raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. 
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 317, in func 
    return f(iterator) 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 715, in func 
    shlex.split(command), env=env, stdin=PIPE, stdout=PIPE) 
    File "C:\Anaconda2\lib\subprocess.py", line 710, in __init__ 
    errread, errwrite) 
    File "C:\Anaconda2\lib\subprocess.py", line 958, in _execute_child 
    startupinfo) 
WindowsError: [Error 2] The system cannot find the file specified 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 

Driver stacktrace: 
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) 
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at scala.Option.foreach(Option.scala:236) 
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) 
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) 
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926) 
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405) 
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:209) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 111, in main 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\worker.py", line 106, in process 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 317, in func 
    return f(iterator) 
    File "D:\spark-1.6.2-bin-hadoop2.6\python\pyspark\rdd.py", line 715, in func 
    shlex.split(command), env=env, stdin=PIPE, stdout=PIPE) 
    File "C:\Anaconda2\lib\subprocess.py", line 710, in __init__ 
    errread, errwrite) 
    File "C:\Anaconda2\lib\subprocess.py", line 958, in _execute_child 
    startupinfo) 
WindowsError: [Error 2] The system cannot find the file specified 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    ... 1 more 
+2

윈도우 배치 파일에는'.cmd' 또는'.bat' 확장자가 있습니다. 그것을 포함 시키려고 했습니까? –

+0

@MCND 오, 어리석은. .. 이름에 확장자가 있어야합니다. 'batchFile'과'batchFileName'에'.bat'을 추가 한 후에, 파일이 존재하지 않는다는 오류를 얻지 못합니다. 그러나 업데이트 된 답변에 표시된 것과 다른 오류가 발생합니다. –

+0

'FileSystem for scheme : D'가 아니므로'D :'는 필요에 따라 처리되지 않습니다. (어쩌면 어리석은 짓일지도 모르겠지만 배치 파일에 대해 알지만, java는 내 영역이 아닙니다.) URI가 필요합니다. 'file : /// D :/...'와 같이 필요합니다. –

답변

0

는 또한/

batchFile = "D://spark-1.6.2-bin-hadoop2.6//data//OpenCV//runOpenCv"

탈출하세요 AA 위의 제안,이 .CMD 또는 .BAT 확장자가있을 수 있습니다.

+0

이스케이프 문자는 \이므로'/'를 이스케이프 할 필요가 없습니다. –