2013-08-08 2 views
0

제 작은 프로젝트를 도와주세요.R-regex 문장으로 분할 된 POS 태그가있는 텍스트 벡터/요소

텍스트 요소 (요소)에는 문장이 있습니다 (사실 많은 텍스트 요소 목록). 텍스트는 POS 태그입니다. 나는 하나의 문장을 분리 된 요소들로 분리 할 필요가있다.

나는 모든 "./$. ", "!/$. ", "?/$. " 등을 일부 R 함수와 일치시켜야하며 결과를 요인 목록으로 저장해야합니다. 각 요소는 문장입니다.

샘플 텍스트 & 코드 :

library(stingr) 

# Input vector/factor with "/$. " separated sentences 

r <- c("Ich/PPER habe/VAFIN meinen/PPOSAT Berkeley/NN jetzt/ADV seit/APPR 11/CARD Jahren/NN im/APPRART fast/ADV täglichen/ADJA Einsatz/NN ./$. In/APPR der/ART Schule/NN und/KON im/APPRART Studium/NN war/VAFIN der/ART Rucksack/NN meistens/ADV bis/APPR zum/APPRART bersten/ADJA mit/APPR Büchern/NN gefüllt/VVPP ,/$, jetzt/ADV benutze/VVFIN ich/PPER das/ART gute/ADJA Stück/NN auf/APPR dem/ART Weg/NN zur/APPRART Arbeit/NN !/$. Das/ART Volumen/NN -LRB-/TRUNC 30/CARD Liter/NN -RRB-/TRUNC ist/VAFIN enorm/ADJD und/KON lässt/VVFIN sich/PRF ,/$, dank/APPR der/ART Form/NN ,/$, besonders/ADV für/APPR Bücher/NN und/KON Schreibutensilien/NN ideal/ADJD nutzen/VVINF ./$.") 

# output list of vectors/factors with splited sentences as list elements 
(r.listOfSent <- as.list(strsplit(as.character(r), "//$."))) 
> r.sentlist 
[[1]] 
[1] "Ich/PPER habe/VAFIN meinen/PPOSAT Berkeley/NN jetzt/ADV seit/APPR 11/CARD Jahren/NN im/APPRART fast/ADV täglichen/ADJA Einsatz/NN ."                                                      
[2] " In/APPR der/ART Schule/NN und/KON im/APPRART Studium/NN war/VAFIN der/ART Rucksack/NN meistens/ADV bis/APPR zum/APPRART bersten/ADJA mit/APPR Büchern/NN gefüllt/VVPP ,/$, jetzt/ADV benutze/VVFIN ich/PPER das/ART gute/ADJA Stück/NN auf/APPR dem/ART Weg/NN zur/APPRART Arbeit/NN ."                 
[3] " Das/ART Volumen/NN -LRB-/TRUNC 30/CARD Liter/NN -RRB-/TRUNC ist/VAFIN enorm/ADJD und/KON lässt/VVFIN sich/PRF ,/$, dank/APPR der/ART Form/NN ,/$, besonders/ADV für/APPR Bücher/NN und/KON Schreibutensilien/NN ideal/ADJD nutzen/VVINF ." 

답변

1

이 필요하시면?

# input 
r <- "Ich/PPER habe/VAFIN meinen/PPOSAT Berkeley/NN jetzt/ADV seit/APPR 11/CARD Jahren/NN im/APPRART fast/ADV täglichen/ADJA Einsatz/NN ./$. In/APPR der/ART Schule/NN und/KON im/APPRART Studium/NN war/VAFIN der/ART Rucksack/NN meistens/ADV bis/APPR zum/APPRART bersten/ADJA mit/APPR Büchern/NN gefüllt/VVPP ,/$, jetzt/ADV benutze/VVFIN ich/PPER das/ART gute/ADJA Stück/NN auf/APPR dem/ART Weg/NN zur/APPRART Arbeit/NN !/$. Das/ART Volumen/NN -LRB-/TRUNC 30/CARD Liter/NN -RRB-/TRUNC ist/VAFIN enorm/ADJD und/KON lässt/VVFIN sich/PRF ,/$, dank/APPR der/ART Form/NN ,/$, besonders/ADV für/APPR Bücher/NN und/KON Schreibutensilien/NN ideal/ADJD nutzen/VVINF ./$." 

# function to split sentences at commas, periods and other punctuation marks 
# really we're just splitting at /$ in the string 
r.listOfSent <- unlist(strsplit(r, "\\$.")) 

# output 
[1] "Ich/PPER habe/VAFIN meinen/PPOSAT Berkeley/NN jetzt/ADV seit/APPR 11/CARD Jahren/NN im/APPRART fast/ADV täglichen/ADJA Einsatz/NN ./"          
[2] " In/APPR der/ART Schule/NN und/KON im/APPRART Studium/NN war/VAFIN der/ART Rucksack/NN meistens/ADV bis/APPR zum/APPRART bersten/ADJA mit/APPR Büchern/NN gefüllt/VVPP ,/" 
[3] " jetzt/ADV benutze/VVFIN ich/PPER das/ART gute/ADJA Stück/NN auf/APPR dem/ART Weg/NN zur/APPRART Arbeit/NN !/"                
[4] " Das/ART Volumen/NN -LRB-/TRUNC 30/CARD Liter/NN -RRB-/TRUNC ist/VAFIN enorm/ADJD und/KON lässt/VVFIN sich/PRF ,/"               
[5] " dank/APPR der/ART Form/NN ,/"                                    
[6] " besonders/ADV für/APPR Bücher/NN und/KON Schreibutensilien/NN ideal/ADJD nutzen/VVINF ./" 

맞지 않으면 질문을 편집하여 원하는 출력 샘플을 표시하십시오.

# function to split the string on the literal $. 
r.listOfSent <- strsplit(r, "/$.", fixed=TRUE) 

# which gives 
[[1]] 
[1] "Ich/PPER habe/VAFIN meinen/PPOSAT Berkeley/NN jetzt/ADV seit/APPR 11/CARD Jahren/NN im/APPRART fast/ADV täglichen/ADJA Einsatz/NN ."                                      
[2] " In/APPR der/ART Schule/NN und/KON im/APPRART Studium/NN war/VAFIN der/ART Rucksack/NN meistens/ADV bis/APPR zum/APPRART bersten/ADJA mit/APPR Büchern/NN gefüllt/VVPP ,/$, jetzt/ADV benutze/VVFIN ich/PPER das/ART gute/ADJA Stück/NN auf/APPR dem/ART Weg/NN zur/APPRART Arbeit/NN !" 
[3] " Das/ART Volumen/NN -LRB-/TRUNC 30/CARD Liter/NN -RRB-/TRUNC ist/VAFIN enorm/ADJD und/KON lässt/VVFIN sich/PRF ,/$, dank/APPR der/ART Form/NN ,/$, besonders/ADV für/APPR Bücher/NN und/KON Schreibutensilien/NN ideal/ADJD nutzen/VVINF ." 
+0

'.' 당신의 패턴에서 원하는 대신 "모든 문자"로 취급되고있다 : 알렉스와 블루 마법 학자의 일부 명확히 의견에

UPDATE에게 감사, 여기에 원하는 출력을 생성하는 방법 'strsplit (r, "/ \\ $ \\.")'또는'strsplit (r, "/ $.", fixed = TRUE)'를 제안합니다. –

+0

글자'.'로 갈라지는 것이','와'!'와'?'에 나뉘어 질 것이기 때문에 그 점에 대해서는 확신하지 못합니다. 질문이 무엇을 요구하고있는 것 같아요 ... – Ben

+0

** @ BlueMagister **'strsplit (r, "/ $.", fixed = TRUE)'작품 :) 탱크 당신! – alex