hive/spark-sql을 사용하여 대용량 데이터를 생성하는 방법은 무엇입니까?

그러므로에 위치한

create table seed (i int) 
partitioned by (p int)

0 999
각 레코드가 다른 파티션에 삽입되는 사이 순차 번호 1K 레코드 시드 테이블을 채우는 파티션 시드 테이블을 작성 다른 HDFS 디렉토리와 더 중요한 - 다른 파일에.

p.s.

다음 세트는 1G 레코드

테이블을 생성

insert into table seed partition (p) 
select i,i 
from (select 1) x lateral view posexplode (split (space (999),' ')) e as i,x

set hive.exec.dynamic.partition.mode=nonstrict; 
set hive.exec.max.dynamic.partitions.pernode=1000; 
set hive.hadoop.supports.splittable.combineinputformat=false; 
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

가 필요합니다.
시드 테이블의 1K 레코드 각각은 다른 파일에 있으며 다른 컨테이너에서 읽습니다.
각 컨테이너는 1M 개의 레코드를 생성합니다.

create table t1g 
as 
select s.i*1000000 + e.i + 1 as n 
from seed s lateral view posexplode (split (space (1000000-1),' ')) e as i,x

출처

2017-03-05 13:39:28

찬란한 접근 방식 –

@PraveenKumarKrishnaiyer - 감사합니다 .-) –

hive/spark-sql을 사용하여 대용량 데이터를 생성하는 방법은 무엇입니까?

답변

관련 문제