2012-08-25 3 views
0

아마존의 Elastic MapReduce에서 하이브를 사용하여 테이블을 만들고 데이터를 가져 와서 파티션을 나누었습니다. 이제 테이블 필드 중 가장 빈번한 단어를 계산하는 쿼리를 실행합니다.증가하는 인스턴스 수가 증가하지 않는 이유 하이브 쿼리 속도

1 개의 마스터 인스턴스와 2 개의 코어 인스턴스가 있고 계산하는 데 180 초가 걸렸을 때 쿼리가 실행됩니다. 그런 다음 1 개의 마스터와 10 개의 코어를 갖도록 재구성했으며 180 초가 걸렸습니다. 왜 더 빠를까요?

나는 2 개 코어와 10 개 코어에서 실행하는 경우 거의 동일한 출력을 가지고

Total MapReduce jobs = 2 
Launching Job 1 out of 2 

Number of reduce tasks not specified. Estimated from input data size: 1 
In order to change the average load for a reducer (in bytes): 
    set hive.exec.reducers.bytes.per.reducer=<number> 
In order to limit the maximum number of reducers: 
    set hive.exec.reducers.max=<number> 
In order to set a constant number of reducers: 
    set mapred.reduce.tasks=<number> 
Starting Job = job_201208251929_0003, Tracking URL = http://ip-10-120-250-34.ec2.internal:9100/jobdetails. jsp?jobid=job_201208251929_0003 
Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.120.250.34:9001 -kill  job_201208251929_0003 
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 
2012-08-25 19:38:47,399 Stage-1 map = 0%, reduce = 0% 
2012-08-25 19:39:00,482 Stage-1 map = 3%, reduce = 0% 
2012-08-25 19:39:03,503 Stage-1 map = 5%, reduce = 0% 
2012-08-25 19:39:06,523 Stage-1 map = 10%, reduce = 0% 
2012-08-25 19:39:09,544 Stage-1 map = 18%, reduce = 0% 
2012-08-25 19:39:12,563 Stage-1 map = 24%, reduce = 0% 
2012-08-25 19:39:15,583 Stage-1 map = 35%, reduce = 0% 
2012-08-25 19:39:18,610 Stage-1 map = 45%, reduce = 0% 
2012-08-25 19:39:21,631 Stage-1 map = 53%, reduce = 0% 
2012-08-25 19:39:24,652 Stage-1 map = 67%, reduce = 0% 
2012-08-25 19:39:27,672 Stage-1 map = 75%, reduce = 0% 
2012-08-25 19:39:30,692 Stage-1 map = 89%, reduce = 0% 
2012-08-25 19:39:33,715 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 23.11 sec 
2012-08-25 19:39:34,723 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 23.11 sec 
2012-08-25 19:39:35,730 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 23.11 sec 
2012-08-25 19:39:36,802 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:37,810 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:38,819 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:39,827 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:40,835 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:41,845 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:42,856 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:43,865 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:44,873 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:45,882 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:46,891 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:47,900 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:48,908 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:49,916 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:50,924 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:51,934 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:52,942 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:53,950 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:54,958 Stage-1 map = 100%, reduce = 72%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:55,967 Stage-1 map = 100%, reduce = 72%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:56,976 Stage-1 map = 100%, reduce = 72%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:57,990 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 62.57 sec 
2012-08-25 19:39:59,001 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 62.57 sec 
2012-08-25 19:40:00,011 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 62.57 sec 
2012-08-25 19:40:01,022 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:02,031 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:03,041 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:04,051 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:05,060 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:06,070 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
2012-08-25 19:40:07,079 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 72.86 sec 
MapReduce Total cumulative CPU time: 1 minutes 12 seconds 860 msec 
Ended Job = job_201208251929_0003 
Counters: 
Launching Job 2 out of 2 
Number of reduce tasks determined at compile time: 1 
In order to change the average load for a reducer (in bytes): 
    set hive.exec.reducers.bytes.per.reducer=<number> 
In order to limit the maximum number of reducers: 
    set hive.exec.reducers.max=<number> 
In order to set a constant number of reducers: 
    set mapred.reduce.tasks=<number> 
Starting Job = job_201208251929_0004, Tracking URL = http://ip-10-120-250-34.ec2.internal:9100/jobdetails. jsp?jobid=job_201208251929_0004 
Kill Command = /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.120.250.34:9001 -kill  job_201208251929_0004 
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 
2012-08-25 19:40:30,147 Stage-2 map = 0%, reduce = 0% 
2012-08-25 19:40:43,241 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:44,254 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:45,262 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:46,272 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:47,282 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:48,290 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:49,298 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:50,306 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:51,315 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:52,323 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:53,331 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:54,339 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:55,347 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:56,357 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:57,365 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 7.48 sec 
2012-08-25 19:40:58,374 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:40:59,384 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:00,393 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:01,407 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:02,420 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:03,431 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
2012-08-25 19:41:04,443 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 10.85 sec 
MapReduce Total cumulative CPU time: 10 seconds 850 msec 
Ended Job = job_201208251929_0004 
Counters: 
MapReduce Jobs Launched: 
Job 0: Map: 2 Reduce: 1 Accumulative CPU: 72.86 sec HDFS Read: 4920 HDFS Write: 8371374 SUCCESS 
Job 1: Map: 1 Reduce: 1 Accumulative CPU: 10.85 sec HDFS Read: 8371850 HDFS Write: 456 SUCCESS 
Total MapReduce CPU Time Spent: 1 minutes 23 seconds 710 msec 

답변

1

당신은 하나의 감속기가 - 그것은 대부분의 작업을하고있다. 나는 그것이 이유라고 생각한다.

+0

다시 한번 시도해 보았을 때 1 ** 큰 ** 마스터 인스턴스와 2 ** 큰 ** 코어 인스턴스를 구성했고 그 작업은 120 초가 걸렸으므로 작은 인스턴스보다 60 초 적습니다. – keepkimi

+0

120과 180을 비교하지 말고 약 120-60 대 180을 비교해야합니다. 여기서 60은 작업 시작 시간입니다. 2 배의 속도 향상을 얻었습니까 –

+0

쿼리를 게시 할 수 있습니까? 하이브에서 "주문"과 같은 특정 사항은 항상 단일 감속기를 통과하므로 결과 집합이 클 경우 회피해야합니다. –

0

제 생각에는 쿼리를 실행하는 축소 기의 양을 늘려야한다고 생각합니다. 그것은 다음과 같은 코드에 의해 수행됩니다 n이 이경 양이

set mapred.reduce.tasks=n; 

.

그런 다음 감속기 사이의 데이터 세트의 일부로서 가능한 한 균일하게 배포 할 DISTRIBUTE BY 또는 CLUSTER BY 절 ( CLUSTERED BY와 혼동하지 마십시오)를 사용합니다. 당신은 정렬이 필요하지 않으면, 더 나은

Cluster ByDistribute BySort By 모두 짧은 컷 DISTRIBUTE BY 때문에 사용합니다.

여기는 link to hive manual입니다.