2017-12-20 14 views
0

나는 데이터가 여기에

Segment.organizationId|^|Segment.segmentId|^|SegmentType|^|SegmentName|^|SegmentName.languageId|^|SegmentLocalLanguageLabel|^|SegmentLocalLanguageLabel.languageId|^|ValidFromPeriodEndDate|^|ValidToPeriodEndDate|^|SegmentInactivationReasonCode|^|SegmentOrganizationId|^|IsShariaCompliant|^|IsCorporate|^|IsElimination|^|IsOther|^|InactiveReasonOtherDescription|^|InactiveReasonOtherDescription.languageId|^|IsOperatingSegment|^|SegmentFundbDescription|^|SegmentFundbDescription.languageId|^|SegmentTypeId|^|SegmentInactiveReasonId|^|FFAction|!| 
4295876080|^|7|^|B|^|Test ||^|505074|^|jtrsu|^|505126|^|2010-03-31T00:00:00Z|^||^||^||^|False|^|False|^|False|^|False|^||^|505074|^|False|^||^|505074|^|3013618|^||^|I|!| 

을 설정 내 코드

val df = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialSegment/TEST") 

입니다 그러나 이것은 나에게 올바른 출력

을 제공하지 않습니다 아래에 있습니다

여기 내 출력

+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+--------+-------------+ 
|Segment_organizationId|Segment_segmentId|SegmentType|SegmentName|SegmentName_languageId|SegmentLocalLanguageLabel|SegmentLocalLanguageLabel_languageId|ValidFromPeriodEndDate|ValidToPeriodEndDate|SegmentInactivationReasonCode|SegmentOrganizationId|IsShariaCompliant|IsCorporate|IsElimination|IsOther|InactiveReasonOtherDescription|InactiveReasonOtherDescription_languageId|IsOperatingSegment|SegmentFundbDescription|SegmentFundbDescription_languageId|SegmentTypeId|SegmentInactiveReasonId|FFAction|DataPartition| 
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+--------+-------------+ 
|   4295876080|    7|   B|  Test |      ^|      ^|         ^|      ^|     ^|       ^|     ^|    ^|   ^|   ^|  ^|        ^|          ^|     ^|      ^|         ^|   ^|      ^|  ^|  Japan| 
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+--------+-------------+ 

| 문자가 레코드에 사용 되었기 때문에 이것을 받고 있습니다.

이 상황을 어떻게 처리 할 수 ​​있습니까?

내 예상 출력은이 option 매개 변수에 스파크 SQL에서 지원되지 않습니다 구분자

...+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+ 
|Segment.organizationId|Segment.segmentId|SegmentType|SegmentName|SegmentName.languageId|SegmentLocalLanguageLabel|SegmentLocalLanguageLabel.languageId|ValidFromPeriodEndDate|ValidToPeriodEndDate|SegmentInactivationReasonCode|SegmentOrganizationId|IsShariaCompliant|IsCorporate|IsElimination|IsOther|InactiveReasonOtherDescription|InactiveReasonOtherDescription.languageId|IsOperatingSegment|SegmentFundbDescription|SegmentFundbDescription.languageId|SegmentTypeId|SegmentInactiveReasonId|FFAction| 
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+ 
|4295876080   |7    |B   |Test |  |505074    |jtrsu     |505126        |2010-03-31T00:00:00Z |     |        |      |False   |False  |False  |False |        |505074         |False    |      |505074       |3013618  |      |I  | 
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+ 
+0

를 다음 있어야합니다. 그게 지원되지 않습니다 스파크 SQL –

+0

@ RameshMaharjan 내 질문 업데이트 한 번만 봐 주시기 바랍니다 –

+0

당신은 sparkContext와 함께 여러 문자 구분 기호를 사용하고 rdd를 데이터 세트 또는 데이터 프레임으로 변환해야합니다 –

답변

3

여러 문자 이하입니다. 그래서 여러 개의 문자를 지원하는 split 기능이 있으므로 sparkContext과 함께 할 것을 제안합니다.

그래서 첫 번째 단계는 그 다음 당신이

val header = rdd.filter(_.contains("Segment.organizationId")).map(line => line.split("\\|\\^\\|")).first() 
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq) 

마지막 단계가 될 것에서 schema를 헤더의 첫 번째 줄을 분리 만들어야 sparkContext

val rdd = sc.textFile("s3://trfsmallfffile/FinancialSegment/TEST") 

을 사용하여 파일을 읽을 수있다 을 사용하여 dataframe을 생성해야합니다.

val data = sqlContext.createDataFrame(rdd.filter(!_.contains("Segment.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema).show(false) 

당신은 구분 기호 두 개의 문자를 사용할 수 없습니다 dataframe

+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+ 
|Segment_organizationId|Segment_segmentId|SegmentType|SegmentName|SegmentName_languageId|SegmentLocalLanguageLabel|SegmentLocalLanguageLabel_languageId|ValidFromPeriodEndDate|ValidToPeriodEndDate|SegmentInactivationReasonCode|SegmentOrganizationId|IsShariaCompliant|IsCorporate|IsElimination|IsOther|InactiveReasonOtherDescription|InactiveReasonOtherDescription_languageId|IsOperatingSegment|SegmentFundbDescription|SegmentFundbDescription_languageId|SegmentTypeId|SegmentInactiveReasonId|FFAction|!|| 
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+ 
|4295876080   |7    |B   |Test |  |505074    |jtrsu     |505126        |2010-03-31T00:00:00Z |     |        |      |False   |False  |False  |False |        |505074         |False    |      |505074       |3013618  |      |I|!|  | 
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+