우선, 무엇이 다차원인지 정의하십시오.
Polysemy: The coexistence of many possible meanings for a word or phrase.
(출처 : https://www.google.com/search?q=polysemy) Wordnet에서
:
Synset: a distinct concept/meaning
Lemma: a root form of a word
Part-Of-Speech (POS): the linguistic category of a word
Word: a surface form of a word (surface words are not in WordNet)
(참고 :
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
우리가 잘 알고 있어야합니다 몇 가지 용어가 워드 넷을에
: @alexis는
lemma vs synset
에 대한 답변이 우수합니다.
https://stackoverflow.com/a/42050466/610569; 코드에서 또한
https://stackoverflow.com/a/23715743/610569 및
https://stackoverflow.com/a/29478711/610569)
를 참조하십시오
이
from nltk.corpus import wordnet as wn
# Given a word "run"
word = 'run'
# We get multiple meaning (i.e. synsets) for
# the word in wordnet.
for synset in wn.synsets(word):
# Each synset comes with an ID.
offset = str(synset.offset()).zfill(8)
# Each meaning comes with their
# linguistic category (i.e. POS)
pos = synset.pos()
# Usually, offset + POS is the way
# to index a synset.
idx = offset + '-' + pos
# Each meaning also comes with their
# distinct definition.
definition = synset.definition()
# For each meaning, there are multiple
# root words (i.e. lemma)
lemmas = ','.join(synset.lemma_names())
print ('\t'.join([idx, definition, lemmas]))
[OUT] :
00189565-n a score in baseball made by a runner touching all four bases safely run,tally
00791078-n the act of testing something test,trial,run
07460104-n a race run on foot footrace,foot_race,run
00309011-n a short trip run
01926311-v move fast by using one's feet, with one foot off the ground at any given time run
02075049-v flee; take to one's heels; cut and run scat,run,scarper,turn_tail,lam,run_away,hightail_it,bunk,head_for_the_hills,take_to_the_woods,escape,fly_the_coop,break_away
다시 문제로 돌아가서, 방법 "명사의 평균 다의 계산에 , 동사, 형용사 및 부사 WordNet "에 따라?
우리는 WordNet을 사용하고 있기 때문에 곡면 단어가 비어 있으며 우리는 보조 정리 만 남았습니다.
첫째, 명사, 동사, 형용사에 어떤 보조 정리가 있어야하는지 정의해야합니다.
from nltk.corpus import wordnet as wn
from collections import defaultdict
words_by_pos = defaultdict(set)
for synset in wn.all_synsets():
pos = synset.pos()
for lemma in synset.lemmas():
words_by_pos[pos].add(lemma)
그러나이 POS 대 보조 정리 사이의 관계의 단순한이다 :
# There are 5 POS.
>>> words_by_pos.keys()
dict_keys(['a', 's', 'r', 'n', 'v'])
# Some words have multiple POS tags =(
>>> len(words_by_pos['n'])
119034
>>> len(words_by_pos['v'])
11531
>> len(words_by_pos['n'].intersection(words_by_pos['v']))
4062
은 우리가 그것을 무시하고 이동할 수 있는지 보자 :
# Lets look that the verb 'v' category
num_meanings_per_verb = []
for word in words_by_pos['v']:
# No. of meaning for a word given a POS.
num_meaning = len(wn.synsets(word, pos='v'))
num_meanings_per_verb.append(num_meaning)
print(sum(num_meanings_per_verb)/len(num_meanings_per_verb))
[OUT] :
2.1866273523545225
번호는 무엇을 의미합니까?(이것은 전혀 것을 의미하는 경우)
이것은 즉 2 의미 평균가
워드 넷의 동사 매 중
에 더 많은 의미를 가지고 있다는 사실을 무시하고, 거기에 아마, 그것은 몇 가지 의미하지만, 우리는 num_meanings_per_verb
에있는 값의 개수를 보면 :
Counter({1: 101168,
2: 11136,
3: 3384,
4: 1398,
5: 747,
6: 393,
7: 265,
8: 139,
9: 122,
10: 85,
11: 74,
12: 39,
13: 29,
14: 10,
15: 19,
16: 10,
17: 6,
18: 2,
20: 5,
26: 1,
30: 1,
33: 1})
전체 추적을 표시하십시오. – BrenBarn
아마도'synset.lemma_names'는'sysnet.lemma_names()'여야합니다. – BrenBarn
나는 그것을 조정했지만 여전히 동일한 오류가 발생합니다 – Anna