문자열 목록이 큰 문자열에 가장 잘 맞도록 색인을 제공하는 함수가 필요합니다. 예를 들어문자열에 대한 문자열 목록 정렬에 대한 색인
:
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
문자열 목록 : : 문자열을 감안할 때
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
이 기능은 산출하기 위해 만들 수 있습니다
indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]
설명 된대로 기능이 작동하는지 가
from re import split
from numpy import vstack, zeros
import numpy as np
# I need a function which takes a string and the tokenized list
# and returns the indices for which the tokens were split at
def index_of_split(text_str, list_of_strings):
#?????
return indices
# The text string, string token list, and character binary annotations
# are all given
text = 'Kir4.3 is a inwardly-rectifying potassium channel. Dextran-sulfate is useful in glucose-mediated channels.'
tok = ['Kir4.3', 'is', 'a', 'inwardly-rectifying', 'potassium', 'channel','.', 'Dextran-sulfate', 'is', 'useful' ,'in', 'glucose','-', 'mediated', 'channels','.']
# (This binary array labels the following terms ['Kir4.3', 'Dextran-sulfate', 'glucose'])
bin_ann = [1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
# Here we would apply our function
indices = index_of_split(text, tok)
# This list is the desired output
#indices = [7, 10, 12, 32, 42, 49, 51, 67, 70, 77, 80, 87, 88, 97, 105]
# We could now split the binary array based on these indices
bin_ann_toked = np.split(bin_ann, indices)
# and combine with the tokenized list
tokenized_strings = np.vstack((tok, bin_ann_toked)).T
# Then we can remove the trailing zeros,
# which are likely caused from spaces,
# or other non tokenized text
for i, el in enumerate(tokenized_strings):
tokenized_strings[i][1] = el[1][:len(el[0])]
print(tokenized_strings)
이 가 주어진 다음과 같은 출력을 제공 할 것이다 : 여기
내가 점을 설명하기 위해 만든 스크립트입니다
[['Kir4.3' array([1, 1, 1, 1, 1, 1])]
['is' array([0, 0])]
['a' array([0])]
['inwardly-rectifying'
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])]
['potassium' array([0, 0, 0, 0, 0, 0, 0, 0, 0])]
['channel' array([0, 0, 0, 0, 0, 0, 0])]
['.' array([0])]
['Dextran-sulfate' array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])]
['is' array([0, 0])]
['useful' array([0, 0, 0, 0, 0, 0])]
['in' array([0, 0])]
['glucose' array([1, 1, 1, 1, 1, 1, 1])]
['-' array([0])]
['mediated' array([0, 0, 0, 0, 0, 0, 0, 0])]
['channels' array([0, 0, 0, 0, 0, 0, 0, 0])]
['.' array([0])]]
'포도당', '-', '중재'가없는 경우 왜 '덱스 트란 - 황산염'이 하이픈을 유지합니까? 이러한 상황이 언제 발생하는지 알지 못하면 함수를 하드 코딩해야합니다. –