AWS에서 스트리밍 python map-reduce를 사용하여 stdin을 통해 Hadoop 시퀀스 파일을 읽을 수 없습니다.

아마존의 Elastic Map Reduce에서 map-reduce 작업을 수행하는 간단한 단어를 실행하려고하지만 출력이 헛 스럽습니다. 입력 파일은 hadoop 시퀀스 파일 인 common crawl 파일의 일부입니다. 파일은 크롤링 된 웹 페이지에서 추출 된 텍스트 (html에서 제외됨)로 간주됩니다.AWS에서 스트리밍 python map-reduce를 사용하여 stdin을 통해 Hadoop 시퀀스 파일을 읽을 수 없습니다.

내 AWS 엘라스틱 맵리 듀스 단계는 다음과 같습니다

Mapper: s3://com.gpanterov.scripts/mapper.py 
Reducer: s3://com.gpanterov.scripts/reducer.py 
Input S3 location: s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/textData-00112 
Output S3 location: s3://com.gpanterov.output/job3/

작업이 성공적으로 실행하지만 출력이 횡설수설이다. 이상한 기호와 단어가 전혀 없습니다. 내가 hadoop 시퀀스 파일을 표준을 통해 읽을 수 없기 때문에 이것은 추측입니다? 그러나 그러한 파일에서 어떻게 mr 작업을 실행합니까? 먼저 시퀀스 파일을 텍스트 파일로 변환해야합니까?

이런 부분-00000 모습에서 라인의 첫 번째 부부는 :

#!/usr/bin/env python 

import sys 

for line in sys.stdin: 
    words = line.split() 
    for word in words: 
     print word + "\t" + str(1)

그리고 내 감속기 : 아무것도

#!/usr/bin/env python 

import sys 

def output(previous_key, total): 
    if previous_key != None: 
     print previous_key + " was found " + str(total) + " times" 

previous_key = None 
total = 0 

for line in sys.stdin: 
    key, value = line.split("\t", 1) 
    if key != previous_key: 
     output(previous_key, total) 
     previous_key = key 
     total = 0 
    total += int(value) 

output(previous_key, total)

없다

'\x00\x00\x87\xa0 was found 1 times\t\n' 
'\x00\x00\x8e\x01:\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x05\xc1=K\x02\x01\x00\x80a\xf0\xbc\xf3N\xbd\x0f\xaf\x145\xcdJ!#T\x94\x88ZD\x89\x027i\x08\x8a\x86\x16\x97lp0\x02\x87 was found 1 times\t\n'

내 매퍼 입력 파일에 문제가 있습니다. 로컬 컴퓨터에서 나는 hadoop fs -text textData-00112 | less을 실행했고 이것은 웹 페이지에서 순수한 텍스트를 반환합니다. 이러한 유형의 입력 파일 (공통 크롤링 hadoop 시퀀스 파일)에서 python 스트리밍 mapreduce 작업을 실행하는 방법에 대한 모든 정보를 얻으실 수 있습니다.

출처

2014-01-19 gpanterov

허프 스트리밍 용기에 inputformat으로 SequenceFileAsTextInputFormat을 제공해야합니다.

나는 아마존 AWS의 맵리 듀스를 사용한 적이 있지만, 일반 하둡 설치에이 같이 할 것입니다 : 써니 난다하여

HADOOP=$HADOOP_HOME/bin/hadoop 
$HADOOP jar $HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar \ 
    -input <input_directory> 
    -output <output_directory> \ 
    -mapper "mapper.py" \ 
    -reducer "reducer.py" \ 
    -inputformat SequenceFileAsTextInputFormat

출처

2014-01-19 11:06:39

제안이 문제를 해결했습니다. aws elastic mapreduce API의 추가 인수 상자에 -inputformat SequenceFileAsTextInputFormat 을 추가하면 작업의 출력이 예상대로 작동합니다.

출처

2014-01-19 22:21:14 gpanterov

AWS에서 스트리밍 python map-reduce를 사용하여 stdin을 통해 Hadoop 시퀀스 파일을 읽을 수 없습니다.

답변

관련 문제