Python에서 WordNet 및 NLTK를 사용하여 의미 론적 거리를 가진 기본 텍스트 유사성 루틴을 구현하고 싶습니다. 이것은 두 아이디어/프레이즈/카테고리 A와 B를 synsets, hyponyms, hypernyms, meronyms, metonyms로 확장하고 두 형성된 벡터 a와 b 사이의 거리를 계산하는 것입니다. 나는 이것들을 어떻게 계산할 지, 아마도 코사인 거리 (cosine distance) 일지는 모르겠다.분류 체계 매핑/병합을위한 WorldNet synsets를 통한 기본 텍스트 유사성
대부분의 경우 입력 한 데이터는 구문으로 작성되지 않고 고유 명사 또는 명사 (브랜드 또는 제품 카테고리가있는 제품 이름)로 이루어집니다. 예를 들어 "리조트"가 "고급 호텔"이거나 "검은 색 캐 비어"가 "미식가", "검은 색 캐 비어", B - "미식가"라는 것을 알고 싶습니다.
이것은 어느 정도까지 작동 할 수 있으며 워드 넷을 위아래로 이동하여 하이포/하이픈 문자로 한 단계 위 아래로 조금 더 정교하게 만들 수 있습니다.
나는 Who 또는 뭔가 같은 정교한 것들을 사용하지 않고 충분히 잘 작동하는 간단한 기본 솔루션을 찾고 있습니다.
WordNet보다 나은 것을 사용해야합니까?
가 UPDATE :
나는 각 명사구에게 (NLTK & 워드 넷 사용) 다음과 같은 방법으로 처리하고있다 : 나는 synset (전용 명사)를 수집 구에있는 각 단어에 대한 1. 나는 보완 synset의 각 요소에 대한 상위 집합 및 하위 집합 synset이 있습니다. 지금은 모든 synsets를 계층 구조를 무시한 목록으로 가져옵니다. 2. 각 카테고리 카테고리를 설명하는 키워드에 대해이 과정을 반복합니다. 3. 이제는 각 범주 및 대상에 대한 synset 집합 목록이 있습니다. 각 거리 (코사인 또는 우 및 팔머 거리)를 계산하면됩니다. 나는 두 개의 벡터에서 pairwise distance를 모아서 합산하여 카테고리 또는 타겟을 설명하는 키워드의 수로 정규화합니다. 그런 다음 최소 거리를 선택합니다.
소리는 매우 기본적이고 비효율적입니다. 더 나아질 다음 단계는 무엇입니까?
저는 처음부터 그것을하는 것이 흥미로워요. 어떻게 일을하고 어떻게해야 하는지를 이해하는 것이 가장 좋습니다.
예 : word_list - 대상 : [ '학교', '아이들', '교사']
카테고리 : [[ '비즈니스', '조직', '기업'], [교육], '학교', '대학']
대상 개념 '교육'에 대한 확장 목록, 3 키워드 : [Synset ('school.n.01'), Synset ('school.n. 02), Synset (학교 .n.03), Synset (학교 .n.04), Synset (학교 .n.05), Synset (학교 .n.06) 'school.n.07'), Synset ('child.n.01'), Synset ('kid.n.02'), S Synset ('child.n.02'), Synset ('teacher.n.01'), Synset ('teacher.n.01'), Synset (' Synset ('education.instance.n.01'), Synset ('building.n.01'), Synset ('education.n.03'), Synset ('body.n.02'), Synset ('education_institution.n.01'), Synset ('animal_group.n.01'), Synset ('academy.n.03'), Synset ('alma_mater.n. 01), Synset ('conservatory.n.01'), Synset ('correspondence_school.n.01'), Synset ('crammer.n.03'), Synset ('dance_school.n.01' 'dancing_school.n.01'), Synset ('day_school.n.02'), Synset ('direct-grant_school.n.01'), Synset ('driving_school.n.01'), Synset ('finishing_school.n .01 '), Synset ('flying_school.n.01 '), Synset ('grade_school.n.01 '), Synset ('graduate_school.n.01), Synset ('language_school.n.01'), Synset ('night_school.n.01'), Synset ('nursing_school.n.01'), Synset ('private_school.n.01' ('public_school.n.01'), Synset ('religious_school.n.01'), Synset ('riding_school.n.01'), Synset ('secondary_school.n.01'), Synset ('secretarial_school.n.01' '), Synset ('sunday_school.n.01 '), Synset ('technical_school.n.01 '), Synset ('training_school.n.01 '), Synset ('veterinary_school.n.01 ' Synset ('day_school.n.03'), Synset ('art_nouveau.n.01'), Synset ('ashcan_school.n.01'), Synset ('deconstructivism.n.01'),), Synset ('historical_school.n.01'), Synset ('lake_poets.n.01'), Synset ('pointillism.n.01'), Synset ('secession.n.01')]
확장 된 목록의 범주 개념 'business', 3 개 키워드, 223에 대한 확장 목록 : [Synset ('business.n.01'), Synset ('commercial_enterprise.n.02'), Synset ('occupation.n Synset ('business.n.04'), Synset ('business.n.05'), Synset ('business.n.06'), Synset ('business.n.07'), Synset ('client.n.01'), Synset ('business.n.09'), Synset ('organization.n.01'), Synset ('arrangement.n.03'), Synset (' 02), Synset ('organization.n.04'), Synset ('organization.n.05'), Synset ('organization.n.06'), Synset 'company.n.01'), Synset ('company.n.02'), Synset ('company.n.03'), Synset ('company.n.04'), Synset ('caller.n.01' '), Synset ('company.n.06 '), Synset ('party.n.03 '), Synset (ship's_company.n.01'), Synset ('company.n.09'), Synset ('enterprise.n.02'), Synset ('상거래 .n.01'), Synset ('activity.n.01'), Synset ('concern.n.04'), Synset ('aim.n. 02), Synset ('business_activity.n.01'), Synset ('sector.n.02'), Synset ('people.n.01'), Synset ('acting.n.01'), Synset Synset ('structure.n.03'), Synset ('body.n.02'), Synset ('administration.n.01'), Synset ('orderliness.n.01'), Synset '), Synset ('activity.n.01 '), Synset ('beginning.n.05 '), Synset (' Synset ('organization.n.01'), Synset ('friendy.n.01'), Synset ('organization.n.01'), Synset ('), Synset ('social.gathering.n.01'), Synset ('set.n.05'), Synset ('complement.n.03'), Synset ('unit.n.03'), Synset Synset ('brokerage.n.02'), Synset ('carrier.n.05'), Synset ('chain.n.04'), Synset ('firm.n.01') , Synset ('franchise.n.02'), Synset ('manufacturer.n.01'), Synset ('partnership.n.01'), Synset ('processor.n.01'), Synset (조선 업체. Synset ('underlyformer.n.02'), Synset ('advertising.n.02'), Synset ('agribusiness.n.01'), Synset ('butchery.n.02'), Synset ('construction.n.07'), Synset ('discount_business.n.01'), Synset ('employee-owned_enterprise.n.01'), Synset ('field.n.06'), Synset Synset ('fishing.n.02'), Synset ('industry.n.02'), Synset ('packaging.n.01'), Synset ('print.n.02'), Synset , Synset ('publication.n.04'), Synset ('real-estate_business.n.01'), Synset ('storage.n.03'), Synset ('tourism.n.01'), Synset transportation.n.05 '), Synset ('vent Synset ('accounting.n.01'), Synset ('appointment.n.05'), Synset ('career.n.01'), Synset ('catering.n.03'), Synset), Synset ('confectionery.n.03'), Synset ('employ.n.02'), Synset ('farming.n.02'), Synset (game.n.10 '), Synset Synset ('photograph.n.02'), Synset ('position.n.06'), Synset ('professional.n.02'), Synset ('sport.n.02'), Synset Synset ('trade.n.02'), Synset ('treadmill.n.03'), Synset ('occasions.n.01'), Synset ('land-office_business.n.01'), Synset Synset ('big_business.n.01'), Synset ('shtik.n.02'), Synset ('adhocracy.n.01'), Synset ('affiliate.n.02'), Synset ('연합 .n.03'), Synset ('협회 .n.01'), Synset ('blue.n.03'), Synset Synset ('defense.n.09'), Synset ('deputation.n.01'), Synset ('enterprise.n.02'), Synset ('establishment.n.05'), Synset , Synset ('federation.n.01'), Synset ('fiefdom.n.02'), Synset ('fire_brigade.n.01'), Synset ('force.n.04'), Synset ('girl_scouts. Synset ('grey.n.04'), Synset ('hierarchy.n.02'), Synset ('ho'n.01'), Synset Synset ('기관 .n.01'), Synset ('line_of_defense.n.01'), Synset ('line_organization.n.01'), Synset ('machine.n.06'), Synset ('machine.n.05'), Synset ('musical_organization.n.01'), Synset ('nongovernmental_organization.n.01'), Synset ('party.n.01'), Synset ('peace_corps Synset ('pool.n.03'), Synset ('professional_organization.n.01'), Synset ('quango.n.01'), Synset ('polity.n.02'), Synset , Synset ('tannany_hall.n.01'), Synset ('union.n.01'), Synset ('unit.n.03'), Synset ('calendar.n.01'), Synset ('classification_system. Synset ('contrivance.n.04'), Synset ('coordinate_system.n.01'), Synset ('data_structure.n.01'), Synset ('design.n.02'), Synset ('distribution.n.01'), Synset ('genetic_map.n.01'), Synset ('kinship_system.n.01'), Synset ('lattice.n.01'), Synset ('living_arrangement.n Synset ('ontology.n.01'), Synset ('county_council.n.01'), Synset ('curia.n.01'), Synset ('executive.n.02'), Synset ('government_officials.n.01'), Synset ('사법부 .n.01'), Synset ('02), Synset ('top_brass.n.01'), Synset ('nonprofit_organization.n.01'), Synset ('합리화 .n.04'), Synset ('reorganization.n.01'), Synset 'syn-organization.n.01'), Synset ('syndication.n.01'), Synset ('listing.n.02'), Synset ('order.n.15'), Synset ('randomization.n Synset ('집단화 .n.01'), Synset ('집단화 .n.01'), Synset ('집단화 .n.01'), Synset Synset ('federation.n.03'), Synset ('unionization.n.01'), Synset ('broadcasting_company.n.01'), Synset ('bureau_de_change.n. 01), Synset ('car_company.n.01'), Synset ('closed_shop.n.01'), Synset ('corporate_investor.n.01'), Synset ('distributor.n.03'), Synset Synset ('drug_company.n.01'), Synset ('east_india_company.n.01'), Synset ('electronics_company.n.01'), Synset ('film-company.n.01' Synset ('food_company.n.01'), Synset ('furniture_company.n.01'), Synset ('holding_company.n.01'), Synset ('joint-stock_company.n.01') , Synset ('limited_company.n.01'), Synset (' Synset ('oil_company.n.01'), Synset ('open_shop.n.01'), Synset ('mining_company.n.01'), Synset ('mover.n.04'), Synset ('printing_company.n.01'), Synset ('pipeline_company.n.01'), Synset ('printing_comcern.n.01'), Synset ('record_company.n.01' Synset ('shipper.n.02'), Synset ('shipping_company.n.01'), Synset ('steel_company.n.01'), Synset ('stock_inpany.n.01'), Synset , Synset ('subsidiary_company.n.01'), Synset ('target_company.n.01'), Synset ('think_tank.n.01'), Synset ('transportation_company.n.01'), Synset ('union_shop. Synset ('freemasonry.n.01'), Synset ('ballet_company.n.01'), Synset ('white_knight.n.01'), Synset ('trainband.n.01' Synset ('chorus.n.05'), Synset ('circus.n.01'), Synset ('minstrel_show.n.01'), Synset ('minstrelsy.n.01'), Synset ('opera_company.n Synset ('theater_company.n.01'), Synset ('attendance.n.03'), Synset ('cohort.n.01'), Synset ('number.n.07'), Synset ('fatigue_party.n.01'), Synset ('landing_party.n.01'), Synset ('party_to_the_action Synset ('rescue_party.n.01'), Synset ('search_party.n.01'), Synset ('stretcher_party.n.01'), Synset ('war_party.n.01') ]
범주 개념 '교육'확장 목록 - 97 synsets : [Synset ('education.n.01'), Synset ('education.n.02'), Synset ('education.n.03'), Synset ('education.n.04'), Synset ('education.n.05'), Synset ('department_of_education.n.01'), Synset ('school.n.01'), Synset Synset ('school.n.03'), Synset ('school.n.04'), Synset ('school.n.05'), Synset ('school.n.06'), Synset ('school.n.07'), Synset ('university.n.01'), Synset ('university.n.02'), Synset ('university.n.03'), Synset (' Synset ('content.n.05'), Synset ('learning.n.01'), Synset ('profession.n.02'), Synset ('upbringing.n.01'), Synset ('executive_department.n.01'), Synset ('education_institution.n.01'), Synset ('building.n.01'), Synset ('교육 Synset ('education_institution.n.01'), Synset ('animal_group.n.01'), Synset ('body.n.02'), Synset ('time_period.n.01'), Synset Synset ('education.instance.n.01'), Synset ('coeducation.n.01'), Synset ('continuing_education'), Synset ('body.n.02'), Synset Synset ('course.n.01'), Synset ('elementary_education.n.01'), Synset ('extension.n.04'), Synset ('extracurricular_activity.n.01'), Synset ('work-study_program.n.01'), Synset ('higher_education.n.01'), Synset ('secondary_education.n.01'), Synset ('team_teaching.n.01' Synset ('eruditeness.n.01'), Synset ('experience.n.01'), Synset ('foundation.n.04'), Synset ('physical_education.n.01'), Synset ('acculturation.n.03'), Synset ('mastering.n.01'), Synset ('school.n.03'), Synset ('self- education.n.01'), Synset Synset ('vocational_training.n.01'), Synset ('teaching.n.01'), Synset ('academy.n.03'), Synset ('alma_mater.n.01'), Synset ('conservatory.n.01'), Synset ('correspondence_s chin.n.01 '), Synset ('crammer.n.03 '), Synset ('dance_school.n.01 '), Synset ('dancing_school.n.01 '), Synset ('day_school.n.02 '), Synset ('direct_ grant_school.n.01'), Synset ('driving_school.n.01'), Synset ('finishing_school.n.01'), Synset ('flying_school.n.01'), Synset Synset ('graduate_school.n.01'), Synset ('language_school.n.01'), Synset ('night_school.n.01'), Synset ('nursing_school.n.01' '), Synset ('private_school.n.01 '), Synset ('public_school.n.01 '), Synset ('religious_school.n.01 '), Synset ('riding_school.n.01 ' Synset ('saryay_school.n.01'), Synset ('technical_school.n.01'), Synset ('training_school.n.01'),), Synset ('veterinary_school.n.01'), Synset ('conservatory.n.02'), Synset ('day_school.n.03'), Synset ('art_nouveau.n.01'), Synset ('ashcan_school Synset ('deconstructivism.n.01'), Synset ('historical_school.n.01'), Synset ('lake_poets.n.01'), Synset ('pointillism.n.01'), Synset , Synset ('secession.n.01), Synset ('gown.n.02'), Synset ('varsity.n.01'), Synset ('city_university.n.01'), Synset ('oxbridge.n.01' 'redbrick_university.n.01'), Synset ('multiversity.n.01'), Synset ('open_university.n.01')] 내 대상에 대한
확장 목록, 57 synset을 : [Synset ('학교 Synset ('school.n.02'), Synset ('school.n.03'), Synset ('school.n.04'), Synset ('school.n.05'), Synset ('school.n.06'), Synset ('school.n.07'), Synset ('child.n.01'), Synset ('kid.n.02'), Synset ('kyd. Synset ('teacher.n.02'), Synset ('kid.n.05'), Synset ('teacher.n.01'), Synset ('teacher.n.02' Synset ('education.instance.n.01'), Synset ('building.n.01'), Synset ('education.n.03'), Synset ('body.n.02'), Synset ('time_period.n Synset ('education_institution.n.01'), Synset ('animal_group.n.01'), Synset ('academy.n.03'), Synset ('alma_mater.n.01'), Synset ('conservatory.n.01'), Synset ('dancing_school.n.01'), Synset ('crammer.n.03'), Synset ('dance_school.n.01'), Synset ('dancing_school.n.01' Synset ('finish_school.n.01'), Synset ('finish_school.n.01'), Synset ('finish_school.n.01'), Synset ('finish_school.n.01'), Synset Synset ('school.n.01'), Synset ('school.n.01'), Synset ('school_next'), Synset ('school_next' ('public_school.n.01'), Synset ('religious_school.n.01'), Synset ('nursing_school.n.01'), Synset ('private_school.n.01'), Synset ('secondary_school.n.01'), Synset ('secretarial_school.n.01'), Synset ('sunday_school.n.01'), Synset ('technical_school.n.01'), Synset ('training_school Synset ('veterinary_school.n.01'), Synset ('conservatory.n.02'), Synset ('day_school.n.03'), Synset ('art_nouveau.n.01'), Synset , Synset ('ashcan_school.n.01'), Synset ('deconstructivism.n.01'), Synset ('historical_school.n.01'), Synset ('lake_poets.n.01'), Synset ('pointillism. n.01 '), Synset (' 이제 97
대상 및 사업 분할과 짝 우 및 팔머의 거리를 계산 - 57, 비즈니스 - - (223), 교육 secession.n.01 ')]
나는 3 개 벡터, 대상이 57x223 = 12711; 목표와 교육 사이에 57x97 = 5529로 나눕니다. 비즈니스 거리
대상 : 교육 거리 2305.709117171037/5529은 = 0.9125370052417936 대상 : 5045.417101981877/12,711 = 0.39693313680921066
최소 거리가 교육이다. 그것은 정답입니다.