많은 파일 읽기 hadoop mapreduce 분산 캐시

10 개의 파일과 10 개의 파일을 합한 하나의 큰 파일을 말합니다.많은 파일 읽기 hadoop mapreduce 분산 캐시

분산 캐시, 작업 conf에 있습니다. 내가 그들을 읽을 때

의 감소, 내가 다음 사항을 준수

나는 감소 방법에 분산 캐시에 추가 된 선택된 파일을 읽을. 필자는 모든 축소 방법에서 큰 파일을 읽는 것과 비교하여 각 줄이기에서 읽는 파일 크기가 작을수록 속도가 더 빠를 것으로 예상했습니다. 그러나, 그것은 더 느렸다.
또한 작은 파일로 분할하여 분산 캐시에 추가하면 문제가 심각해집니다. 일 자체는 오랜 시간이 지나야 실행되기 시작했습니다.

이유를 찾을 수 없습니다. Pls 도움.

2012-11-02 Mahalakshmi Lakshminarayanan

당신의 문제는 reduce() 파일을 읽는 것에 있다고 생각합니다. configure() (이전 API 사용) 또는 setup() (새 API 사용)의 파일을 읽어야합니다. 따라서 모든 감속기에 대해 감속기에 대한 각 입력 그룹에 대해 읽기보다는 한 번만 읽습니다 (기본적으로 메소드를 줄이기위한 각 호출).

다음과 같이 작성할 수 있습니다. NEW mapreduce API .apache.hadoop.mapreduce *) -.

public static class ReduceJob extends Reducer<Text, Text, Text, Text> { 

    ... 
Path file1; 
Path file2; 
... 

    @Override 
      protected void setup(Context context) throws IOException, InterruptedException { 

       // Get the file from distributed cached 
    file1 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[0]; 
    file2 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[1]; 

       // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap. 
      } 



      @Override 
      protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, 
        InterruptedException { 
    ... 
    } 
    }

는 OLD는 API (org.apache.hadoop.mapred를 mapred 사용 *) -.

public static class ReduceJob extends MapReduceBase implements Reducer<Text, Text, Text, Text> { 

    ... 
Path file1; 
Path file2; 
... 

     @Override 
     public void configure(JobConf job) { 

       // Get the file from distributed cached 
    file1 = DistributedCache.getLocalCacheFiles(job)[0] 
    file2 = DistributedCache.getLocalCacheFiles(job)[1] 
... 

       // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap. 
      } 


@Override 
     public synchronized void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, 
       Reporter reporter) throws IOException { 
    ... 
    } 
    }

출처

2012-11-02 21:01:17 Amar

많은 파일 읽기 hadoop mapreduce 분산 캐시

답변

관련 문제