기계 학습을위한 임의의 피쳐 추출을위한 모듈 식 및 확장 가능한 코드를 만드는 방법은 무엇입니까?

필자는 기능에 대한 피쳐 추출을 수행하는 파이썬 모듈을 작성하기 위해 노력해 왔으며, 결국 기계 학습 알고리즘에 의해 사용되었습니다.기계 학습을위한 임의의 피쳐 추출을위한 모듈 식 및 확장 가능한 코드를 만드는 방법은 무엇입니까?

필자의 접근 방식은 (수작업으로 작성된) 피쳐로 초기 금 표준 데이터 세트를 보강하고 새로운 데이터 세트를 작성하여 교육에 비용이 소요될 수있는 피쳐 생성이 필요하지 않도록하기위한 것입니다. 이것이 핵심 기능이 항상 포함되어있는 대부분의 데이터 세트 (예 : 품사 태그, 이름이 지정된 항목 태그, 의미 라벨 등)의 표준이라고 생각합니다.

내가 사용하는 데이터 세트에는 토큰 화 된 모든 문장 만 포함되며 XML 태그로 서식이 지정됩니다. 예 :

<s> 
    <lex begin='351' end='354'>The</lex> 
    <lex begin='355' end='361'>people</lex> 
    <lex begin='362' end='366'>here</lex> 
    <lex begin='367' end='370'>are</lex> 
    <lex begin='371' end='374'>far</lex> 
    <lex begin='375' end='384'>wealthier</lex> 
    <lex begin='384' end='385'>.</lex> 
</s>

각 토큰에 정보를 추가하고 싶습니다. 품사, NER, 시맨틱 레이블 등이 포함됩니다.

나는 Stanford NLP POS tagger과 Stanford NLP NER tagger을 사용하고 있습니다. 이것들은 엄청나게 느리지 만, (희망을 갖고) 속도가 POS와 NER 라벨을보다 정확하게 제공합니다. 나는 또한 의미 론적 레이블을 얻기 위해 다른 파서를 던진다. 아래는 새로운 문장으로, 기능이 보강되었습니다.

<s> 
    <lex ner='O' begin='351' end='354' pos='DT' label='None'>The</lex> 
    <lex CATEGORY='#ref-category PERSON' begin='355' end='361' 
     FORM='#ref-category COMMON-NOUN/PLURAL' ENDS-AT='#edges ending at 3' 
     CONSTITUENTS='NIL' USED-IN='NIL' Type='SPARSER::EDGE' LEFT-DAUGHTER='#word "people"' 
     pos='NNS' RULE='#PSR577 person - "people"' label='SPATIAL_ENTITY' 
     REFERENT='#people 1' POSITION-IN-RESOURCE-ARRAY='1' SPANNED-WORDS='NIL' 
     RIGHT-DAUGHTER=':SINGLE-TERM' ner='O' Class='#STRUCTURE-CLASS SPARSER::EDGE' 
     STARTS-AT='#edges starting at 2'>people</lex> 
    <lex CATEGORY='#ref-category DEICTIC-LOCATION' begin='362' end='366' 
     FORM='#ref-category PROPER-NOUN' ENDS-AT='#edges ending at 4' 
     CONSTITUENTS='NIL' USED-IN='NIL' Type='SPARSER::EDGE' LEFT-DAUGHTER='#word "here"' 
     pos='RB' RULE='#PSR271 deictic-location - "here"' label='PLACE' 
     REFERENT='#deictic-location "here" 3' POSITION-IN-RESOURCE-ARRAY='3' 
     SPANNED-WORDS='NIL' RIGHT-DAUGHTER=':SINGLE-TERM' ner='O' 
     Class='#STRUCTURE-CLASS SPARSER::EDGE' STARTS-AT='#edges starting at 3'>here</lex> 
    <lex CATEGORY='#ref-category BE' begin='367' end='370' 
     FORM='#ref-category VERB' ENDS-AT='#edges ending at 5' CONSTITUENTS='NIL' 
     USED-IN='NIL' Type='SPARSER::EDGE' LEFT-DAUGHTER='#word "are"' pos='VBP' 
     RULE='#PSR145 be - "are"' label='None' REFERENT='#be 1' 
     POSITION-IN-RESOURCE-ARRAY='4' SPANNED-WORDS='NIL' RIGHT-DAUGHTER=':SINGLE-TERM' 
     ner='O' Class='#STRUCTURE-CLASS SPARSER::EDGE' STARTS-AT='#edges starting at 4'>are</lex> 
    <lex CATEGORY='#word "far"' begin='371' end='374' 
     FORM='#ref-category SPATIAL-PREPOSITION' ENDS-AT='#edges ending at 6' 
     CONSTITUENTS='NIL' USED-IN='NIL' Type='SPARSER::EDGE' LEFT-DAUGHTER='#word "far"' 
     pos='RB' RULE='(5)' label='None' REFERENT='#word "far"' 
     POSITION-IN-RESOURCE-ARRAY='5' SPANNED-WORDS='NIL' RIGHT-DAUGHTER=':LITERAL-IN-A-RULE' 
     ner='O' Class='#STRUCTURE-CLASS SPARSER::EDGE' STARTS-AT='#edges starting at 5'>far</lex> 
    <lex ner='O' begin='375' end='384' pos='JJR' label='None'>wealthier</lex> 
    <lex begin='384' end='385'>.</lex> 
</s>

분명히 더 이상 사람이 읽을 수 없지만, 이것은 기계 학습 알고리즘에 연결된 기능 일 뿐이므로 중요하지 않습니다.

제 목적으로는이 작업을 한 번만 수행하면됩니다. 예를 들어, 훈련 직전에 덜 비싼 기능을 추가 할 수 있습니다. 대문자 단어입니다.

그러나 현재 솔루션은 정말 끔찍하며 나중에 다시 조작하는 방법을 모르므로 나중에 누군가가 자신의 후크/기능을 넣을 수 있습니다 (예 : 일부 사용자의 새로운 기능을 빠르게 추가하려는 경우). 다른 파서) 쉽게.

xml_tokens_pattern = re.compile(r'<TOKENS>.+</TOKENS>', re.DOTALL) 
sentence_pattern = re.compile(r'<s>.+?</s>', re.DOTALL) 
lex_attrs_pattern = re.compile(r'(?<=<lex)[^>]+') 

class Feature_Process(object): 
    """Wrapper for adding features to xmls. 

    """ 
    def __init__(self, xmls, golddir, newdir='', suffix='++', 
       feature_functions=[], renew=False, debug=False): 
     self.xmls = xmls 
     self.golddir = golddir 
     self.newdir = newdir 
     self.suffix = suffix 
     self.feature_functions = feature_functions 
     self.renew = renew 
     self.debug = debug 
     self.heavy = False 

    def process(self): 
     for xml in self.xmls: 
      path = setup_newdir(xml, self.golddir, self.newdir, 
           self.suffix, self.renew) 
      if not path: 
       continue 
      mkparentdirs(path) 
      with open(xml, 'r') as oldfile: 
       text = oldfile.read() 
      doc = Space_Document(xml) 
      tags = [tag for tag in doc.tags if 'start' in tag.attrib] 
      new_text = text 
      for (i,m) in enumerate(re.finditer(sentence_pattern, text)): 
       sentence = doc.sentences[i] 
       doc_lexes = sentence.getchildren() 
       xml_sentence = m.group() 
       tokens = [''.join([c if ord(c) < 128 
            else u2ascii[c] 
            for c in x.text]).encode('utf-8') 
          for x in doc_lexes] 
       (pos_tags, ner_tags, edges) = ([], [], []) 
       if self.heavy: 
        pos_tags = pos.tag(tokens) 
        ner_tags = ner.tag(tokens) 
        try: 
         if self.debug: 
          print ' '.join([x for x in tokens]) 
         edges = p(' '.join([x for x in tokens]), split=True) 
        except: 
         'somehow got here' 
       c = 0 
       for (j, n) in enumerate(re.finditer(lex_attrs_pattern, 
                xml_sentence)): 
        doc_lex = doc_lexes[j] 
        new_lex = Lex(doc_lex.text, doc_lex.attrib) 
        attributes = n.group() 
        tag = binary_search((int(doc_lex.attrib['begin']), 
             int(doc_lex.attrib['end']), 
             doc_lex.text), tags) 
        label = 'None' 
        if type(tag) != type(None): 
         label = tag.tag 
        new_lex.add(('label', label)) 
        new_lex.add(('word', new_lex.text.encode('utf-8'))) 
        if type(tag) != type(None): 
          new_lex.addAll([(key, tag.attrib[key]) for key in tag.attrib]) 
        if pos_tags: 
         if tokens[j] == pos_tags[c][0]: 
          new_lex.add(('pos', pos_tags[c][1])) 
          pos_tags.remove(pos_tags[c]) 
        if ner_tags: #this error case comes up for RFC/Durango.xml 
         if tokens[j] == ner_tags[c][0]: 
          new_lex.add(('ner', ner_tags[c][1])) 
          ner_tags.remove(ner_tags[c]) 
        if edges: 
         sparser_edge = ledge(edges, tokens[j]) 
         if sparser_edge: 
          if sparser_edge.keyvalues: 
           keyvalues = sparser_edge.keyvalues[sparser_edge.keyvalues.keys()[0]] 
           new_lex.addAll([(key, keyvalues[key]) for key in keyvalues]) 
        new_lex.addAll([function(new_lex) for function in self.feature_functions]) 
        new_text = new_text.replace(attributes, str(new_lex)) 
      w = open(path, 'w') 
      print>>w, new_text 
      w.close()

출처

2014-12-23 user3898238

가 먼저 소스 및 대상으로 XML을 사용하려고하는 경우, 그것은 손으로 XML을 구문 분석을 시도하는 실수는 거의 항상 : 여기 내 작업 솔루션입니다. 여러 개의 Python XML 파싱 라이브러리를 사용하여 조작 할 수있는 구조 나 스트림을 생성하십시오.

분류 자에 의해 해석 될 수있는 어휘 기능을 추가하기위한 API를 기본 목표로 제공하는 경우 데이터 구조 조작과 직렬화/비 직렬화를 명확하게 구분하는 것이 좋습니다. 그러나 이와 같은 단순한 경우에는 그다지 어렵지 않습니다.

출처

2015-01-05 20:08:11

기계 학습을위한 임의의 피쳐 추출을위한 모듈 식 및 확장 가능한 코드를 만드는 방법은 무엇입니까?

답변

관련 문제