2016-07-08 15 views
0

Python에서 WordNet 및 NLTK를 사용하여 의미 론적 거리를 가진 기본 텍스트 유사성 루틴을 구현하고 싶습니다. 이것은 두 아이디어/프레이즈/카테고리 A와 B를 synsets, hyponyms, hypernyms, meronyms, metonyms로 확장하고 두 형성된 벡터 a와 b 사이의 거리를 계산하는 것입니다. 나는 이것들을 어떻게 계산할 지, 아마도 코사인 거리 (cosine distance) 일지는 모르겠다.분류 체계 매핑/병합을위한 WorldNet synsets를 통한 기본 텍스트 유사성

대부분의 경우 입력 한 데이터는 구문으로 작성되지 않고 고유 명사 또는 명사 (브랜드 또는 제품 카테고리가있는 제품 이름)로 이루어집니다. 예를 들어 "리조트"가 "고급 호텔"이거나 "검은 색 캐 비어"가 "미식가", "검은 색 캐 비어", B - "미식가"라는 것을 알고 싶습니다.

이것은 어느 정도까지 작동 할 수 있으며 워드 넷을 위아래로 이동하여 하이포/하이픈 문자로 한 단계 위 아래로 조금 더 정교하게 만들 수 있습니다.

나는 Who 또는 뭔가 같은 정교한 것들을 사용하지 않고 충분히 잘 작동하는 간단한 기본 솔루션을 찾고 있습니다.

WordNet보다 나은 것을 사용해야합니까?


가 UPDATE :

나는 각 명사구에게 (NLTK & 워드 넷 사용) 다음과 같은 방법으로 처리하고있다 : 나는 synset (전용 명사)를 수집 구에있는 각 단어에 대한 1. 나는 보완 synset의 각 요소에 대한 상위 집합 및 하위 집합 synset이 있습니다. 지금은 모든 synsets를 계층 구조를 무시한 목록으로 가져옵니다. 2. 각 카테고리 카테고리를 설명하는 키워드에 대해이 과정을 반복합니다. 3. 이제는 각 범주 및 대상에 대한 synset 집합 목록이 있습니다. 각 거리 (코사인 또는 우 및 팔머 거리)를 계산하면됩니다. 나는 두 개의 벡터에서 pairwise distance를 모아서 합산하여 카테고리 또는 타겟을 설명하는 키워드의 수로 정규화합니다. 그런 다음 최소 거리를 선택합니다.

소리는 매우 기본적이고 비효율적입니다. 더 나아질 다음 단계는 무엇입니까?

저는 처음부터 그것을하는 것이 흥미로워요. 어떻게 일을하고 어떻게해야 하는지를 이해하는 것이 가장 좋습니다.


예 : word_list - 대상 : [ '학교', '아이들', '교사']

카테고리 : [[ '비즈니스', '조직', '기업'], [교육], '학교', '대학']

대상 개념 '교육'에 대한 확장 목록, 3 키워드 : [Synset ('school.n.01'), Synset ('school.n. 02), Synset (학교 .n.03), Synset (학교 .n.04), Synset (학교 .n.05), Synset (학교 .n.06) 'school.n.07'), Synset ('child.n.01'), Synset ('kid.n.02'), S Synset ('child.n.02'), Synset ('teacher.n.01'), Synset ('teacher.n.01'), Synset (' Synset ('education.instance.n.01'), Synset ('building.n.01'), Synset ('education.n.03'), Synset ('body.n.02'), Synset ('education_institution.n.01'), Synset ('animal_group.n.01'), Synset ('academy.n.03'), Synset ('alma_mater.n. 01), Synset ('conservatory.n.01'), Synset ('correspondence_school.n.01'), Synset ('crammer.n.03'), Synset ('dance_school.n.01' 'dancing_school.n.01'), Synset ('day_school.n.02'), Synset ('direct-grant_school.n.01'), Synset ('driving_school.n.01'), Synset ('finishing_school.n .01 '), Synset ('flying_school.n.01 '), Synset ('grade_school.n.01 '), Synset ('graduate_school.n.01), Synset ('language_school.n.01'), Synset ('night_school.n.01'), Synset ('nursing_school.n.01'), Synset ('private_school.n.01' ('public_school.n.01'), Synset ('religious_school.n.01'), Synset ('riding_school.n.01'), Synset ('secondary_school.n.01'), Synset ('secretarial_school.n.01' '), Synset ('sunday_school.n.01 '), Synset ('technical_school.n.01 '), Synset ('training_school.n.01 '), Synset ('veterinary_school.n.01 ' Synset ('day_school.n.03'), Synset ('art_nouveau.n.01'), Synset ('ashcan_school.n.01'), Synset ('deconstructivism.n.01'),), Synset ('historical_school.n.01'), Synset ('lake_poets.n.01'), Synset ('pointillism.n.01'), Synset ('secession.n.01')]

확장 된 목록의 범주 개념 'business', 3 개 키워드, 223에 대한 확장 목록 : [Synset ('business.n.01'), Synset ('commercial_enterprise.n.02'), Synset ('occupation.n Synset ('business.n.04'), Synset ('business.n.05'), Synset ('business.n.06'), Synset ('business.n.07'), Synset ('client.n.01'), Synset ('business.n.09'), Synset ('organization.n.01'), Synset ('arrangement.n.03'), Synset (' 02), Synset ('organization.n.04'), Synset ('organization.n.05'), Synset ('organization.n.06'), Synset 'company.n.01'), Synset ('company.n.02'), Synset ('company.n.03'), Synset ('company.n.04'), Synset ('caller.n.01' '), Synset ('company.n.06 '), Synset ('party.n.03 '), Synset (ship's_company.n.01'), Synset ('company.n.09'), Synset ('enterprise.n.02'), Synset ('상거래 .n.01'), Synset ('activity.n.01'), Synset ('concern.n.04'), Synset ('aim.n. 02), Synset ('business_activity.n.01'), Synset ('sector.n.02'), Synset ('people.n.01'), Synset ('acting.n.01'), Synset Synset ('structure.n.03'), Synset ('body.n.02'), Synset ('administration.n.01'), Synset ('orderliness.n.01'), Synset '), Synset ('activity.n.01 '), Synset ('beginning.n.05 '), Synset (' Synset ('organization.n.01'), Synset ('friendy.n.01'), Synset ('organization.n.01'), Synset ('), Synset ('social.gathering.n.01'), Synset ('set.n.05'), Synset ('complement.n.03'), Synset ('unit.n.03'), Synset Synset ('brokerage.n.02'), Synset ('carrier.n.05'), Synset ('chain.n.04'), Synset ('firm.n.01') , Synset ('franchise.n.02'), Synset ('manufacturer.n.01'), Synset ('partnership.n.01'), Synset ('processor.n.01'), Synset (조선 업체. Synset ('underlyformer.n.02'), Synset ('advertising.n.02'), Synset ('agribusiness.n.01'), Synset ('butchery.n.02'), Synset ('construction.n.07'), Synset ('discount_business.n.01'), Synset ('employee-owned_enterprise.n.01'), Synset ('field.n.06'), Synset Synset ('fishing.n.02'), Synset ('industry.n.02'), Synset ('packaging.n.01'), Synset ('print.n.02'), Synset , Synset ('publication.n.04'), Synset ('real-estate_business.n.01'), Synset ('storage.n.03'), Synset ('tourism.n.01'), Synset transportation.n.05 '), Synset ('vent Synset ('accounting.n.01'), Synset ('appointment.n.05'), Synset ('career.n.01'), Synset ('catering.n.03'), Synset), Synset ('confectionery.n.03'), Synset ('employ.n.02'), Synset ('farming.n.02'), Synset (game.n.10 '), Synset Synset ('photograph.n.02'), Synset ('position.n.06'), Synset ('professional.n.02'), Synset ('sport.n.02'), Synset Synset ('trade.n.02'), Synset ('treadmill.n.03'), Synset ('occasions.n.01'), Synset ('land-office_business.n.01'), Synset Synset ('big_business.n.01'), Synset ('shtik.n.02'), Synset ('adhocracy.n.01'), Synset ('affiliate.n.02'), Synset ('연합 .n.03'), Synset ('협회 .n.01'), Synset ('blue.n.03'), Synset Synset ('defense.n.09'), Synset ('deputation.n.01'), Synset ('enterprise.n.02'), Synset ('establishment.n.05'), Synset , Synset ('federation.n.01'), Synset ('fiefdom.n.02'), Synset ('fire_brigade.n.01'), Synset ('force.n.04'), Synset ('girl_scouts. Synset ('grey.n.04'), Synset ('hierarchy.n.02'), Synset ('ho'n.01'), Synset Synset ('기관 .n.01'), Synset ('line_of_defense.n.01'), Synset ('line_organization.n.01'), Synset ('machine.n.06'), Synset ('machine.n.05'), Synset ('musical_organization.n.01'), Synset ('nongovernmental_organization.n.01'), Synset ('party.n.01'), Synset ('peace_corps Synset ('pool.n.03'), Synset ('professional_organization.n.01'), Synset ('quango.n.01'), Synset ('polity.n.02'), Synset , Synset ('tannany_hall.n.01'), Synset ('union.n.01'), Synset ('unit.n.03'), Synset ('calendar.n.01'), Synset ('classification_system. Synset ('contrivance.n.04'), Synset ('coordinate_system.n.01'), Synset ('data_structure.n.01'), Synset ('design.n.02'), Synset ('distribution.n.01'), Synset ('genetic_map.n.01'), Synset ('kinship_system.n.01'), Synset ('lattice.n.01'), Synset ('living_arrangement.n Synset ('ontology.n.01'), Synset ('county_council.n.01'), Synset ('curia.n.01'), Synset ('executive.n.02'), Synset ('government_officials.n.01'), Synset ('사법부 .n.01'), Synset ('02), Synset ('top_brass.n.01'), Synset ('nonprofit_organization.n.01'), Synset ('합리화 .n.04'), Synset ('reorganization.n.01'), Synset 'syn-organization.n.01'), Synset ('syndication.n.01'), Synset ('listing.n.02'), Synset ('order.n.15'), Synset ('randomization.n Synset ('집단화 .n.01'), Synset ('집단화 .n.01'), Synset ('집단화 .n.01'), Synset Synset ('federation.n.03'), Synset ('unionization.n.01'), Synset ('broadcasting_company.n.01'), Synset ('bureau_de_change.n. 01), Synset ('car_company.n.01'), Synset ('closed_shop.n.01'), Synset ('corporate_investor.n.01'), Synset ('distributor.n.03'), Synset Synset ('drug_company.n.01'), Synset ('east_india_company.n.01'), Synset ('electronics_company.n.01'), Synset ('film-company.n.01' Synset ('food_company.n.01'), Synset ('furniture_company.n.01'), Synset ('holding_company.n.01'), Synset ('joint-stock_company.n.01') , Synset ('limited_company.n.01'), Synset (' Synset ('oil_company.n.01'), Synset ('open_shop.n.01'), Synset ('mining_company.n.01'), Synset ('mover.n.04'), Synset ('printing_company.n.01'), Synset ('pipeline_company.n.01'), Synset ('printing_comcern.n.01'), Synset ('record_company.n.01' Synset ('shipper.n.02'), Synset ('shipping_company.n.01'), Synset ('steel_company.n.01'), Synset ('stock_inpany.n.01'), Synset , Synset ('subsidiary_company.n.01'), Synset ('target_company.n.01'), Synset ('think_tank.n.01'), Synset ('transportation_company.n.01'), Synset ('union_shop. Synset ('freemasonry.n.01'), Synset ('ballet_company.n.01'), Synset ('white_knight.n.01'), Synset ('trainband.n.01' Synset ('chorus.n.05'), Synset ('circus.n.01'), Synset ('minstrel_show.n.01'), Synset ('minstrelsy.n.01'), Synset ('opera_company.n Synset ('theater_company.n.01'), Synset ('attendance.n.03'), Synset ('cohort.n.01'), Synset ('number.n.07'), Synset ('fatigue_party.n.01'), Synset ('landing_party.n.01'), Synset ('party_to_the_action Synset ('rescue_party.n.01'), Synset ('search_party.n.01'), Synset ('stretcher_party.n.01'), Synset ('war_party.n.01') ]

범주 개념 '교육'확장 목록 - 97 synsets : [Synset ('education.n.01'), Synset ('education.n.02'), Synset ('education.n.03'), Synset ('education.n.04'), Synset ('education.n.05'), Synset ('department_of_education.n.01'), Synset ('school.n.01'), Synset Synset ('school.n.03'), Synset ('school.n.04'), Synset ('school.n.05'), Synset ('school.n.06'), Synset ('school.n.07'), Synset ('university.n.01'), Synset ('university.n.02'), Synset ('university.n.03'), Synset (' Synset ('content.n.05'), Synset ('learning.n.01'), Synset ('profession.n.02'), Synset ('upbringing.n.01'), Synset ('executive_department.n.01'), Synset ('education_institution.n.01'), Synset ('building.n.01'), Synset ('교육 Synset ('education_institution.n.01'), Synset ('animal_group.n.01'), Synset ('body.n.02'), Synset ('time_period.n.01'), Synset Synset ('education.instance.n.01'), Synset ('coeducation.n.01'), Synset ('continuing_education'), Synset ('body.n.02'), Synset Synset ('course.n.01'), Synset ('elementary_education.n.01'), Synset ('extension.n.04'), Synset ('extracurricular_activity.n.01'), Synset ('work-study_program.n.01'), Synset ('higher_education.n.01'), Synset ('secondary_education.n.01'), Synset ('team_teaching.n.01' Synset ('eruditeness.n.01'), Synset ('experience.n.01'), Synset ('foundation.n.04'), Synset ('physical_education.n.01'), Synset ('acculturation.n.03'), Synset ('mastering.n.01'), Synset ('school.n.03'), Synset ('self- education.n.01'), Synset Synset ('vocational_training.n.01'), Synset ('teaching.n.01'), Synset ('academy.n.03'), Synset ('alma_mater.n.01'), Synset ('conservatory.n.01'), Synset ('correspondence_s chin.n.01 '), Synset ('crammer.n.03 '), Synset ('dance_school.n.01 '), Synset ('dancing_school.n.01 '), Synset ('day_school.n.02 '), Synset ('direct_ grant_school.n.01'), Synset ('driving_school.n.01'), Synset ('finishing_school.n.01'), Synset ('flying_school.n.01'), Synset Synset ('graduate_school.n.01'), Synset ('language_school.n.01'), Synset ('night_school.n.01'), Synset ('nursing_school.n.01' '), Synset ('private_school.n.01 '), Synset ('public_school.n.01 '), Synset ('religious_school.n.01 '), Synset ('riding_school.n.01 ' Synset ('saryay_school.n.01'), Synset ('technical_school.n.01'), Synset ('training_school.n.01'),), Synset ('veterinary_school.n.01'), Synset ('conservatory.n.02'), Synset ('day_school.n.03'), Synset ('art_nouveau.n.01'), Synset ('ashcan_school Synset ('deconstructivism.n.01'), Synset ('historical_school.n.01'), Synset ('lake_poets.n.01'), Synset ('pointillism.n.01'), Synset , Synset ('secession.n.01), Synset ('gown.n.02'), Synset ('varsity.n.01'), Synset ('city_university.n.01'), Synset ('oxbridge.n.01' 'redbrick_university.n.01'), Synset ('multiversity.n.01'), Synset ('open_university.n.01')] 내 대상에 대한

확장 목록, 57 synset을 : [Synset ('학교 Synset ('school.n.02'), Synset ('school.n.03'), Synset ('school.n.04'), Synset ('school.n.05'), Synset ('school.n.06'), Synset ('school.n.07'), Synset ('child.n.01'), Synset ('kid.n.02'), Synset ('kyd. Synset ('teacher.n.02'), Synset ('kid.n.05'), Synset ('teacher.n.01'), Synset ('teacher.n.02' Synset ('education.instance.n.01'), Synset ('building.n.01'), Synset ('education.n.03'), Synset ('body.n.02'), Synset ('time_period.n Synset ('education_institution.n.01'), Synset ('animal_group.n.01'), Synset ('academy.n.03'), Synset ('alma_mater.n.01'), Synset ('conservatory.n.01'), Synset ('dancing_school.n.01'), Synset ('crammer.n.03'), Synset ('dance_school.n.01'), Synset ('dancing_school.n.01' Synset ('finish_school.n.01'), Synset ('finish_school.n.01'), Synset ('finish_school.n.01'), Synset ('finish_school.n.01'), Synset Synset ('school.n.01'), Synset ('school.n.01'), Synset ('school_next'), Synset ('school_next' ('public_school.n.01'), Synset ('religious_school.n.01'), Synset ('nursing_school.n.01'), Synset ('private_school.n.01'), Synset ('secondary_school.n.01'), Synset ('secretarial_school.n.01'), Synset ('sunday_school.n.01'), Synset ('technical_school.n.01'), Synset ('training_school Synset ('veterinary_school.n.01'), Synset ('conservatory.n.02'), Synset ('day_school.n.03'), Synset ('art_nouveau.n.01'), Synset , Synset ('ashcan_school.n.01'), Synset ('deconstructivism.n.01'), Synset ('historical_school.n.01'), Synset ('lake_poets.n.01'), Synset ('pointillism. n.01 '), Synset (' 이제 97

대상 및 사업 분할과 짝 우 및 팔머의 거리를 계산 - 57, 비즈니스 - - (223), 교육 secession.n.01 ')]


나는 3 개 벡터, 대상이 57x223 = 12711; 목표와 교육 사이에 57x97 = 5529로 나눕니다. 비즈니스 거리

대상 : 교육 거리 2305.709117171037/5529은 = 0.9125370052417936 대상 : 5045.417101981877/12,711 = 0.39693313680921066

최소 거리가 교육이다. 그것은 정답입니다.

답변

0

WordNet + 일부 유사점은 해결책이 될 수 있습니다. 또한 Word2Vec를 사용하여 WordNet synset/* nyms 검색에서 얻은 단어의 의미 적 거리를 확인할 수 있습니다.

어쩌면 누군가가 특정 라이브러리에 도움을 줄 수 있습니다. (내가 직접 사용할 수있는 순간에는 아무 것도 생각 나지 않습니다.)