이 Damerau-Levenshtein 구현에서 버그를 수정하는 방법은 무엇입니까?

또 다른 오랜 질문으로 돌아 왔습니다. Python에 기반을 둔 Damerau-Levenshtein 편집 거리 구현을 실험 해본 결과, I finally found the one listed below은 editdistance_reference()입니다. 그것 올바른 결과를 제공하는 것 같습니다 효율적인 구현을 가지고 나타납니다.이 Damerau-Levenshtein 구현에서 버그를 수정하는 방법은 무엇입니까?

그래서 코드를 Cython으로 변환하도록 설정했습니다. 내 테스트 데이터에서 참조 메서드는 결과를 12,000자를 사용하여 11,000 건의 비교를 위해 을 전달하는 반면, Cythonized 메서드는 초 당 200,000 건의 비교를 수행합니다. 안타깝게도 결과가 잘못되었습니다. 디버그 용으로 인쇄 할 변수 thisrow 을 보면, 참조 출력이 다른 그림을 표시하는 동안에 어떤 데이터를 던져도 내 버전이 채워져 있습니다. 예를 들어, 'world' 대해 'helo' 테스트를 생성하는 다음과 같은 출력 (ED가 EDR가 올바르게 작동 기준 인 제 기능을 표시한다) :

editdistance() 가입일 :

#ED A [0, 0, 0, 0, 0, 1] 
#ED B [1, 0, 0, 0, 0, 1] 
#ED B [1, 1, 0, 0, 0, 1] 
#ED B [1, 1, 1, 0, 0, 1] 
#ED B [1, 1, 1, 1, 0, 1] 
#ED B [1, 1, 1, 1, 1, 1] 

#ED A [0, 0, 0, 0, 0, 2] 
#ED B [1, 0, 0, 0, 0, 2] 
#ED B [1, 1, 0, 0, 0, 2] 
#ED B [1, 1, 1, 0, 0, 2] 
#ED B [1, 1, 1, 1, 0, 2] 
#ED B [1, 1, 1, 1, 1, 2] 

#ED A [0, 0, 0, 0, 0, 3] 
#ED B [1, 0, 0, 0, 0, 3] 
#ED B [1, 1, 0, 0, 0, 3] 
#ED B [1, 1, 1, 0, 0, 3] 
#ED B [1, 1, 1, 1, 0, 3] 
#ED B [1, 1, 1, 1, 1, 3] 

#ED A [0, 0, 0, 0, 0, 4] 
#ED B [1, 0, 0, 0, 0, 4] 
#ED B [1, 1, 0, 0, 0, 4] 
#ED B [1, 1, 1, 0, 0, 4] 
#ED B [1, 1, 1, 1, 0, 4] 
#ED B [1, 1, 1, 1, 1, 4]

editdistance_reference()에서 :

#EDR A [0, 0, 0, 0, 0, 1] 
#EDR B [1, 0, 0, 0, 0, 1] 
#EDR B [1, 2, 0, 0, 0, 1] 
#EDR B [1, 2, 3, 0, 0, 1] 
#EDR B [1, 2, 3, 4, 0, 1] 
#EDR B [1, 2, 3, 4, 5, 1] 

#EDR A [0, 0, 0, 0, 0, 2] 
#EDR B [2, 0, 0, 0, 0, 2] 
#EDR B [2, 2, 0, 0, 0, 2] 
#EDR B [2, 2, 3, 0, 0, 2] 
#EDR B [2, 2, 3, 4, 0, 2] 
#EDR B [2, 2, 3, 4, 5, 2] 

#EDR A [0, 0, 0, 0, 0, 3] 
#EDR B [3, 0, 0, 0, 0, 3] 
#EDR B [3, 3, 0, 0, 0, 3] 
#EDR B [3, 3, 3, 0, 0, 3] 
#EDR B [3, 3, 3, 3, 0, 3] 
#EDR B [3, 3, 3, 3, 4, 3] 

#EDR A [0, 0, 0, 0, 0, 4] 
#EDR B [4, 0, 0, 0, 0, 4] 
#EDR B [4, 4, 0, 0, 0, 4] 
#EDR B [4, 4, 4, 0, 0, 4] 
#EDR B [4, 4, 4, 4, 0, 4] 
#EDR B [4, 4, 4, 4, 4, 4]

오류가 아마도 매우 명백한 것들 중 하나이기 때문에 나는 매우 바보가되어야합니다. 하지만 나는 그것을 찾을 수없는 것 같습니다.

두 번째 문제가 : 나는 malloc 세 배열 twoago, oneago를위한 공간, thisrow, 다음 그들이 순환 방식으로 주변에 교환받을. free(twoago) 등을 시도 할 때 줄이 생기고 glibc는 double free or corruption에 대해 불평합니다. 내가 그걸 봤 거든. 포인터 스와핑 비즈니스가 glibc를 약간 어지럽게 만들 수 있으므로 메모리를 올바르게 확보 할 수 없게 될 수 있습니까? 나는 아래

먼저 컴파일 (/path/to/python3.1 ./setup.py build_ext --inplace), 다음 편집 거리 코드는 적절한, 그래서 관심 사람들이 쉽게 복제 할 찾기를 할 필요가있는 setup.py 나열합니다.

한 가지 더 : 이것은 Python3.1에서 실행됩니다. 하나의 재미있는 점은 *.pyx 파일 안에 베어 코드가 있지만, print은 여전히 함수가 아니라 명세서라는 것입니다.

네, 여기에 붙여 넣을 코드가 많이 있다는 것을 압니다. 그러나 코드는 단순히 실행 가능하지 않습니다. 너무 많이 잘라냅니다. 나는 editdistance()을 제외한 모든 방법이 올바르게 작동한다고 믿지만, 당신이 느끼는 모든 문제를 지적 해 주시기 바랍니다.

setup.py :

from distutils.core import setup 
from distutils.extension import Extension 
from Cython.Distutils import build_ext 

setup(
    name   = 'cython_dameraulevenshtein', 
    ext_modules  = [ 
    Extension('cython_dameraulevenshtein', [ 'cython_dameraulevenshtein.pyx', ]), ], 
    cmdclass  = { 
    'build_ext': build_ext },)

cython_dameraulevenshtein.pyx

가 (재미있는 물건을보기 위해 끝까지 모든 방법을 스크롤) :

############################################################################################################ 
cdef extern from "stdlib.h": 
    ctypedef unsigned int size_t 
    void  *malloc(size_t size) 
    void  *realloc(void *ptr, size_t size) 
    void  free(void *ptr) 

#----------------------------------------------------------------------------------------------------------- 
cdef inline unsigned int _minimum_of_two_uints(unsigned int a, unsigned int b): 
    if a < b: return a 
    return b 

#----------------------------------------------------------------------------------------------------------- 
cdef inline unsigned int _minimum_of_three_uints(unsigned int a, unsigned int b, unsigned int c): 
    if a < b: 
    if c < a: 
     return c 
    return a 
    if c < b: 
    return c 
    return b 

#----------------------------------------------------------------------------------------------------------- 
cdef inline int _warp(unsigned int limit, int value): 
    return value if value >= 0 else limit + value 

############################################################################################################ 
# ARRAYS THAT SAY SIZE ;-) 
#----------------------------------------------------------------------------------------------------------- 
cdef class Array_of_unsigned_int: 
    cdef unsigned int *data 
    cdef unsigned int length 

    #--------------------------------------------------------------------------------------------------------- 
    def __cinit__(self, unsigned int length, fill_value = None): 
    self.length = length 
    self.data = <unsigned int *>malloc(length * sizeof(unsigned int)) ###OBS### must check malloc doesn't return NULL pointer 
    if fill_value is not None: 
     self.fill(fill_value) 

    #--------------------------------------------------------------------------------------------------------- 
    cdef fill(self, unsigned int value): 
    cdef unsigned int idx 
    cdef unsigned int *d = self.data 
    for idx from 0 <= idx < self.length: 
     d[ idx ] = value 

    #--------------------------------------------------------------------------------------------------------- 
    cdef resize(self, unsigned int length): 
    self.data = <unsigned int *>realloc(self.data, length * sizeof(unsigned int)) ###OBS### must check realloc doesn't return NULL pointer 
    self.length = length 

    #--------------------------------------------------------------------------------------------------------- 
    def free(self): 
    """Always remember the milk: Free up memory.""" 
    free(self.data) ###OBS### should free memory here 

    #--------------------------------------------------------------------------------------------------------- 
    def as_list(self): 
    """Return the array as a Python list.""" 
    R      = [] 
    cdef unsigned int idx 
    cdef unsigned int *d = self.data 
    for idx from 0 <= idx < self.length: 
     R.append(d[ idx ]) 
    return R 


############################################################################################################ 
# CONVERTING UNICODE TO CHARACTER IDs (CIDs) 
#--------------------------------------------------------------------------------------------------------- 
cdef unsigned int _UMX_surrogate_lower_bound = 0x10000 
cdef unsigned int _UMX_surrogate_upper_bound = 0x10ffff 
cdef unsigned int _UMX_surrogate_hi_lower_bound = 0xd800 
cdef unsigned int _UMX_surrogate_hi_upper_bound = 0xdbff 
cdef unsigned int _UMX_surrogate_lo_lower_bound = 0xdc00 
cdef unsigned int _UMX_surrogate_lo_upper_bound = 0xdfff 
cdef unsigned int _UMX_surrogate_foobar_factor = 0x400 

#--------------------------------------------------------------------------------------------------------- 
cdef Array_of_unsigned_int _cids_from_text(text): 
    """Givn a ``text`` either as a Unicode string or as a ``bytes`` or ``bytearray``, return an instance of 
    ``Array_of_unsigned_int`` that enumerates either the Unicode codepoints of each character or the value of 
    each byte. Surrogate pairs will be condensed into single values, so on narrow Python builds the length of 
    the array returned may be less than ``len(text)``.""" 
    #......................................................................................................... 
    # Make sure ``text`` is either a Unicode string (``str``) or a ``bytes``-like thing: 
    is_bytes = isinstance(text, (bytes, bytearray,)) 
    assert is_bytes or isinstance(text, str), '#121' 
    #......................................................................................................... 
    # Whether it is a ``str`` or a ``bytes``, we know the result can only have at most as many elements as 
    # there are characters in ``text``, so we can already reserve that much space (in the case of a Unicode 
    # text, there may be fewer CIDs if there happen to be surrogate characters): 
    cdef unsigned int   length = <unsigned int>len(text) 
    cdef Array_of_unsigned_int R  = Array_of_unsigned_int(length) 
    #......................................................................................................... 
    # If ``text`` is empty, we can return an empty array right away: 
    if length == 0: return R 
    #......................................................................................................... 
    # Otherwise, prepare to copy data: 
    cdef unsigned int idx    = 0 
    #......................................................................................................... 
    # If ``text`` is a ``bytes``-like thing, use simplified processing; we just have to copy over all byte 
    # values and are done: 
    if is_bytes: 
    for idx from 0 <= idx < length: 
     R.data[ idx ] = <unsigned int>text[ idx ] 
    return R 
    #......................................................................................................... 
    cdef unsigned int cid    = 0 
    cdef bool   is_surrogate  = False 
    cdef unsigned int hi    = 0 
    cdef unsigned int lo    = 0 
    cdef unsigned int chr_count   = 0 
    #......................................................................................................... 
    # Iterate over all indexes in text: 
    for idx from 0 <= idx < length: 
    #....................................................................................................... 
    # If we met with a surrogate CID in the last cycle, then that was a high surrogate CID, and the 
    # corresponding low CID is on the current position. Having both, we can compute the intended CID 
    # and reset the flag: 
    if is_surrogate: 
     lo = <unsigned int>ord(text[ idx ]) 
     # IIRC, this formula was documented in Unicode 3: 
     cid = ((hi - _UMX_surrogate_hi_lower_bound) * _UMX_surrogate_foobar_factor 
      + (lo - _UMX_surrogate_lo_lower_bound) + _UMX_surrogate_lower_bound) 
     is_surrogate = False 
    #....................................................................................................... 
    else: 
     # Otherwise, we retrieve the CID from the current position: 
     cid = <unsigned int>ord(text[ idx ]) 
     #..................................................................................................... 
     if _UMX_surrogate_hi_lower_bound <= cid <= _UMX_surrogate_hi_upper_bound: 
     # If this CID is a high surrogate CID, set ``hi`` to this value and set a flag so we'll come back 
     # in the next cycle: 
     hi    = cid 
     is_surrogate  = True 
     continue 
    #....................................................................................................... 
    R.data[ chr_count ] = cid 
    chr_count  += 1 
    #......................................................................................................... 
    # Surrogate CIDs take up two characters but end up as a single resultant CID, so the return value may 
    # have fewer elements than the naive string length indicated; in this case, we want to free some memory 
    # and correct array length data: 
    if chr_count != length: 
    R.resize(chr_count) 
    #......................................................................................................... 
    return R 

#--------------------------------------------------------------------------------------------------------- 
def cids_from_text(text): 
    cdef Array_of_unsigned_int c_R =_cids_from_text(text) 
    R        = c_R.as_list() 
    c_R.free() ###OBS### should free memory here 
    return R 


############################################################################################################ 
# SECOND-ORDER SIMILARITY 
#----------------------------------------------------------------------------------------------------------- 
cpdef float similarity(char *a, char *b): 
    """Given two byte strings ``a`` and ``b``, return their Damerau-Levenshtein similarity as a float between 
    0.0 and 1.1. Similarity is computed as ``1 - relative_editdistance(a, b)``, so a result of ``1.0`` 
    indicates identity, while ``0.0`` indicates complete dissimilarity.""" 
    return 1.0 - relative_editdistance(a, b) 

#----------------------------------------------------------------------------------------------------------- 
cpdef float relative_editdistance(char *a, char *b): 
    """Given two byte strings ``a`` and ``b``, return their relative Damerau-Levenshtein distance. The return 
    value is a float between 0.0 and 1.0; it is calculated as the absolute edit distance, divided by the 
    length of the longer string. Therefore, ``0.0`` indicates identity, while ``1.0`` indicates complete 
    dissimilarity.""" 
    cdef int length = max(len(a), len(b)) 
    if length == 0: return 0.0 
    return editdistance(a, b)/<float>length 

############################################################################################################ 
# EDIT DISTANCE 
#----------------------------------------------------------------------------------------------------------- 
cpdef unsigned int editdistance(text_a, text_b): 
    """Given texts as Unicode strings or ``bytes``/``bytearray`` objects, return their absolute 
    Damerau-Levenshtein distance. Each deletion, insertion, substitution, and transposition is counted as one 
    difference, so the edit distance between ``abc`` and ``ab``, ``abcx``, ``abx``, ``acb``, respectively, is 
    ``1``.""" 
    #......................................................................................................... 
    # This should be fast in Python, as it can (and probably is) implemented by doing an identity check in 
    # the case of ``bytes`` and ``str`` objects: 
    if text_a == text_b: return 0 
    #......................................................................................................... 
    # Convert Unicode text to C array of unsigned integers: 
    cdef Array_of_unsigned_int a = _cids_from_text(text_a) 
    cdef Array_of_unsigned_int b = _cids_from_text(text_b) 
    R        = c_editdistance(a, b) 
    #......................................................................................................... 
    # Always remember the milk: 
    a.free() 
    b.free() 
    #......................................................................................................... 
    return R 

#----------------------------------------------------------------------------------------------------------- 
cdef unsigned int c_editdistance(Array_of_unsigned_int cids_a, Array_of_unsigned_int cids_b): 
    # Conceptually, this is based on a len(a) + 1 * len(b) + 1 matrix. 
    # However, only the current and two previous rows are needed at once, 
    # so we only store those. 
    #......................................................................................................... 
    # This shortcut is pretty useless if comparison is not very fast; therefore, it is done in the function 
    # that deals with the Python objects, q.v. 
    # if cids_a.equals(cids_b): return 0 
    #......................................................................................................... 
    cdef unsigned int a_length   = cids_a.length 
    cdef unsigned int b_length   = cids_b.length 
    #......................................................................................................... 
    # Another shortcut: if one of the texts is empty, then the edit distance is trivially the length of the 
    # other text. This also works for two empty texts, but those have already been taken care of by the 
    # previous shortcut: 
    #......................................................................................................... 
    if a_length == 0: return b_length 
    if b_length == 0: return a_length 
    #......................................................................................................... 
    cdef unsigned int row_length   = b_length + 1 
    cdef unsigned int row_length_1  = row_length - 1 
    cdef unsigned int row_bytecount  = sizeof(unsigned int) * row_length 
    cdef unsigned int *oneago    = <unsigned int *>malloc(row_bytecount) ###OBS### must check malloc doesn't return NULL pointer 
    cdef unsigned int *twoago    = <unsigned int *>malloc(row_bytecount) ###OBS### must check malloc doesn't return NULL pointer 
    cdef unsigned int *thisrow   = <unsigned int *>malloc(row_bytecount) ###OBS### must check malloc doesn't return NULL pointer 
    cdef unsigned int idx     = 0 
    cdef unsigned int idx_a    = 0 
    cdef unsigned int idx_b    = 0 
    cdef   int idx_a_1_text  = 0 
    cdef   int idx_b_1_row   = 0 
    cdef   int idx_b_2_row   = 0 
    cdef   int idx_b_1_text  = 0 
    cdef unsigned int deletion_cost  = 0 
    cdef unsigned int addition_cost  = 0 
    cdef unsigned int substitution_cost = 0 
    #......................................................................................................... 
    # Equivalent of ``thisrow = list(range(1, b_length + 1)) + [ 0 ]``: 
    #print('#305', cids_a.as_list(), cids_b.as_list(), a_length, b_length, row_length, row_length_1) 
    for idx from 1 <= idx < row_length: 
    thisrow[ idx - 1 ] = idx 
    thisrow[ row_length - 1 ] = 0 
    #......................................................................................................... 
    for idx_a from 0 <= idx_a < a_length: 
    idx_a_1_text  = _warp( a_length, idx_a - 1) 
    twoago, oneago = oneago, thisrow 
    #....................................................................................................... 
    # Equivalent of ``thisrow = [ 0 ] * b_length + [ idx_a + 1 ]``: 
    for idx from 0 <= idx < row_length_1: 
     thisrow[ idx ] = 0 
    thisrow[ row_length - 1 ] = idx_a + 1 
    #....................................................................................................... 
    # some diagnostic output: 
    x = [] 
    for idx from 0 <= idx < row_length: x.append(thisrow[ idx ]) 
    print 
    print '#ED A', x 
    #....................................................................................................... 
    for idx_b from 0 <= idx_b < b_length: 
     #..................................................................................................... 
     idx_b_1_row  = _warp(row_length, idx_b - 1) 
     idx_b_1_text  = _warp( b_length, idx_b - 1) 
     #..................................................................................................... 
     assert 0 <= idx_b_1_row < row_length, ('#323', idx_b_1_row,) 
     assert 0 <= idx_a_1_text < a_length, ('#324', idx_a_1_text,) 
     assert 0 <= idx_b_1_text < b_length, ('#325', idx_b_1_text,) 
     #..................................................................................................... 
     deletion_cost  = oneago[ idx_b  ] + 1 
     addition_cost  = thisrow[ idx_b_1_row ] + 1 
     substitution_cost = oneago[ idx_b_1_row ] + (1 if cids_a.data[ idx_a ] 
                  != cids_b.data[ idx_b ] else 0) 
     thisrow[ idx_b ] = _minimum_of_three_uints(deletion_cost, addition_cost, substitution_cost) 
     #..................................................................................................... 
     # Transpositions: 
     if ( idx_a > 0 
     and idx_b > 0 
     and cids_a.data[ idx_a  ] == cids_b.data[ idx_b_1_text ] 
     and cids_a.data[ idx_a_1_text ] == cids_b.data[ idx_b  ] 
     and cids_a.data[ idx_a  ] != cids_b.data[ idx_b  ]): 
     #................................................................................................... 
     idx_b_2_row  = _warp(row_length, idx_b - 2) 
     assert 0 <= idx_b_2_row < row_length, ('#340', idx_b_2_row,) 
     thisrow[ idx_b ] = _minimum_of_two_uints(thisrow[ idx_b ], twoago[ idx_b_2_row ] + 1) 
     #..................................................................................................... 
     # some diagnostic output: 
     x = [] 
     for idx from 0 <= idx < row_length: x.append(thisrow[ idx ]) 
     print '#ED B', x 
    #......................................................................................................... 
    # Here, ``b_length - 1`` can't become negative, since we already tested for ``b_length == 0`` in the 
    # shortcut above: 
    cdef unsigned int R = thisrow[ b_length - 1 ] 
    #......................................................................................................... 
    # Always remember the milk: 
    # BUG: Activating below lines leads to glibc failing with ``double free or corruption`` 
    #free(twoago) 
    #free(oneago) 
    #free(thisrow)e 
    #......................................................................................................... 
    return R 

#----------------------------------------------------------------------------------------------------------- 
def editdistance_reference(text_a, text_b): 
    """This method is believed to compute a correct Damerau-Levenshtein edit distance, with deletions, 
    insertions, substitutions, and transpositions. Do not touch it; it is here to validate results returned 
    from the above method. Code adapted from 
    http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance""" 
    # Conceptually, the implementation is based on a ``(len(seq1) + 1) * (len(seq2) + 1)`` matrix. 
    # However, only the current and two previous rows are needed at once, so we only store those. Python 
    # lists wrap around for negative indices, so we put the leftmost column at the *end* of the list. This 
    # matches with the zero-indexed strings and saves extra calculation. 
    b_length = len(text_b) 
    oneago = None 
    thisrow = list(range(1, b_length + 1)) + [ 0 ] 
    for idx_a in range(len(text_a)): 
    twoago, oneago, thisrow = oneago, thisrow, [ 0 ] * b_length + [ idx_a + 1 ] 
    #....................................................................................................... 
    # some diagnostic output: 
    print 
    print '#EDR A', thisrow 
    #....................................................................................................... 
    for idx_b in range(b_length): 
     deletion_cost  = oneago[ idx_b  ] + 1 
     addition_cost  = thisrow[ idx_b - 1 ] + 1 
     substitution_cost = oneago[ idx_b - 1 ] + (text_a[ idx_a ] != text_b[ idx_b ]) 
     thisrow[ idx_b ] = min(deletion_cost, addition_cost, substitution_cost) 
     if ( idx_a > 0 
     and idx_b > 0 
     and text_a[ idx_a  ] == text_b[ idx_b - 1 ] 
     and text_a[ idx_a - 1 ] == text_b[ idx_b  ] 
     and text_a[ idx_a  ] != text_b[ idx_b  ]): 
     thisrow[ idx_b ] = min(thisrow[ idx_b ], twoago[ idx_b - 2 ] + 1) 
     #..................................................................................................... 
     # some diagnostic output: 
     print '#EDR B', thisrow 
     #..................................................................................................... 
    return thisrow[ len(text_b) - 1 ]

편집는 또한 pastebin이 텍스트와 사이 썬 목록을 게시 .

출처

2010-08-07 flow

일부 기본 디버깅을 수행합니다. #ED B이라고 표시된 두 번째 출력 줄에서 잘못되었다는 것을 알고 계십니다. 잘못된 값은 조기에 하나의 편집을 발견하고 더 이상 찾지 못함을 나타냅니다. 이것은 min() args 중 하나가 어떻게 든 1에 고정되어 있기 때문일 수 있습니다. deletion_cost, substitution_cost, addition_cost ... 잘못된 것인가? 왜 틀렸어? 입력 텍스트 값을 인쇄하십시오. 트랜스 포지션 섹션을 일시적으로 비활성화하여 문제가 없어지는지 확인하십시오. _warp caper (내가 본 적이 있다면 속임수가있는 호빗 요술)와 그 사용법을 확인하고 다시 확인하십시오. "aaaaa"와 "aaaaa"를 비교하면 어떻게됩니까? "쿼티"와 "쿼티"? "xxxxx"와 "yyyyy"? 문제가 bytes, bytearray 및 str 입력으로 모두 발생합니까?

무료 문제 : 나는 현기증이 아닌 부패를 의심합니다. 세 개의 배열을 인쇄하십시오. 그 내용이 예상대로입니까? 한 번에 하나의 배열 free()을 활성화 해보십시오 - 모두 고장 났습니까? 단 하나? 어느 것?

메모리 관리에 대한 조언 : this을 읽고 malloc/free 대신 Python 관련 루틴을 사용하는 것이 좋습니다. 대리모가 있다면 배열을 소형화하는 것이 맨 위에 보인다.

업데이트 : 내 자신의 제안을 따르십시오. 삭제 비용이 채워졌습니다. "oneago"는 "thisrow"와 동일했습니다. 잘못된 대답과 두 배의 (-! 손상되지 않은! -) 자유로운 원인 : 포인터의 원형 셔플이 원형이 아니 었습니다.

# twoago, oneago = oneago, thisrow ### BUG ### 
twoago, oneago, thisrow = oneago, thisrow, twoago ### FIXED ###

업데이트 2 : [코멘트 용량이 너무 작] 내가 제안 없음 모조, 그냥 일반 보통 디버깅 삽질. "내 수정 프로그램에 집중하는 것"은 "읽을 수없는"것이 아닙니다. 참조 코드는 각 패스에 대해 새 목록을 만듭니다. thisrow은 이전 패스에서 이월 된 항목이 없기 때문에 수행 할 수 있습니다. 실제로 이것을 수행 할 필요는 없으며 첫 번째 요소와 마지막 요소를 제외한 초기화는 임의의 숫자로 구성 될 수 있으며 목록을 채우기 위해서만 존재하므로 일부 항목이 아닌 인덱스로 추가 할 수 있습니다. - 구현이 어렵습니다. 그래서 당신은 여분의 (낭비 된) malloc/free를 수행하는 대신에 "레퍼런스 구현"을 slavishly 에뮬레이트 할 수 있습니다. 그렇지 않으면 아마도 파이썬 특정 구현 세부 사항을 무시하고 레퍼런스 구현만을 아마도 아마도 정확한 대답의 소스로 사용할 수 있습니다. 그런 다음 내 수정 사항을 적용하고 나중에 thisrow 배열의 초기화 대부분을 잘라서 시간을 절약 할 수 있습니다.

업데이트 3 : 다음은 대체 참조 구현입니다. 외부 루프 내부에서 목록 생성의 오버 헤드를 피하기 위해 처음에는 3 개의 행을 할당합니다. 또한 thisrow의 마지막 요소를 제외한 모든 요소의 불필요한 초기화를 피할 수 있습니다. 이렇게하면 C/Cython으로 쉽게 변환 할 수 있습니다.

def damlevref2(seq1, seq2): 
    # For Python 2.x as was the original. 
    # Appears to work on Python 1.5.2 as well :-) 
    seq2len = len(seq2) 
    twoago = [-777] * (seq2len + 1) # pseudo-malloc; any old rubbish will do 
    oneago = [-666] * (seq2len + 1) # ditto 
    thisrow = range(1, seq2len + 1) + [0] 
    for x in xrange(len(seq1)): 
     twoago, oneago, thisrow = oneago, thisrow, twoago # circular "pointer" shuffle 
     thisrow[-1] = x + 1 
     for y in xrange(seq2len): 
      delcost = oneago[y] + 1 
      addcost = thisrow[y - 1] + 1 
      subcost = oneago[y - 1] + (seq1[x] != seq2[y]) 
      thisrow[y] = min(delcost, addcost, subcost) 
      if (x > 0 and y > 0 and seq1[x] == seq2[y - 1] 
       and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]): 
       thisrow[y] = min(thisrow[y], twoago[y - 2] + 1) 
    return thisrow[seq2len - 1]

출처

2010-08-07 23:55:21

"tricksy hobbit gimmick"--- _warp() 함수는 인덱스 -1 및 -2가있는 마지막 요소와 마지막 요소 옆에 액세스하는 코드의 원래 아이디어를 유지하기 쉽게 만들었습니다. 그런 식으로 최소한 코드 저글링을 유지할 수 있습니다. 나는 또한 각각 2와 3 개의 부호없는 정수의 최소값을 찾기 위해 두 가지 함수를 쓰는 것을 보면서 정말 바보 같았다. OTH 내가 C/Cython에 익숙하지 않다. 그냥 작동시키고 싶다. LOC/지적 오버 헤드는 무시할 만하다. 테스트는 내일 진행되며 많은 유용한 팁을 제공합니다. – flow

테스트를 기다릴 수 없었다 : 나는 malloc에서 PyMem_Malloc로 전환 한 다음 스와핑을 비활성화했다. 결국 glibc 문제를 일으키는 스와핑이다. 나는 겉으로보기 엔 별칭을 교환하여이 문제를 해결했고, 원래의 튕겨지지 않은 포인터에 PyMem_Free를 호출했다. – flow

"겉으로보기에"등 : 이것은 "화산에서 죽은 닭고기를 흔들며 분출을 멈췄다"와 동일합니다. 순환 이동하는 3 개의 포인터 대신 2 개의 포인터를 교환하는 것이 문제였습니다. 어느 malloc/free 팀이 사용되는지는 문제와 무관합니다. N.B. 나는 위의 설명을 읽기 전에 문제를 발견했습니다. 어떤 경우 든 아직은 단지 반쯤 이해할 수 있습니다 :-) –

이 Damerau-Levenshtein 구현에서 버그를 수정하는 방법은 무엇입니까?

답변

관련 문제