2016-08-25 3 views
0

파이썬으로 웹 스크래핑을 할 때 문제가 있습니다.예기치 않은 "amp"및 ";"가있는 python url

코드 :

from urllib.request import urlopen 
from urllib.request import urlretrieve 
from bs4 import BeautifulSoup 
import urllib.error 
import http.cookiejar,requests,pymysql,json ,re 
session = requests.Session() 
monthurl = 'http://search.proquest.com/publication.publicationissuebrowse:drilldown/month/%E5%85%AB%E6%9C%88/08/year/2016/parentmonth082016' 
payload = {"site": "news","t:ac" : "publications_105983"} 
headers = {'user-agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0','Accept':'text/javascript, text/html,application/xml, text/xml, */*',\ 
     'Accept-Encoding':'gzip, deflate','Accept-Language':'zh-CN,zh;q=0.8','Host':'search.proquest.com', 'Content-type':'application/x-www-form-urlencoded; charset=UTF-8', 'Connection':'keep-alive','Content-Length':'0','Origin':'http://search.proquest.com','Referer':'http://search.proquest.com/news/publication/105983/citation/99D2C84D41804033PQ/2?accountid=13818','X-Prototype-Version':'1.7','X-Requested-With':'XMLHttpRequest',\ 
     'Cookie':'availability-zone=us-east-1a; mwtbid=830706AE-9389-4BB4-812D-B597683B812E; _ga=GA1.2.1201070524.1446763952; fsr.r=%7B%22d%22%3A90%2C%22i%22%3A%22de07553-78769885-bcc1-4823-67c96%22%2C%22e%22%3A1467984529571%7D; fulltextShowAll=YES; oneSearchTZ=480; authenticatedBy=IP; availability-zone=us-east-1a; _gat_UA-61126923-3=1; JSESSIONID=69A1CC852FF1123A9A78CFC18E2B6AFF.i-b86ebbb9; OS_VWO_COUNTRY=CN; OS_VWO_INSTITUTION=13818; OS_VWO_LANGUAGE=zho; OS_VWO_MY_RESEARCH=false; OS_VWO_REFERRING_URL=""; OS_VWO_REQUESTED_URL="http://search.proquest.com/news/publication/105983/citation/8558F5818C234BCFPQ/2?accountid=13818"; OS_PERSISTENT="wrPZtfJDrH0WIWT5cZZs+CwLAAUhJMHD++Vls3rVx5E="; OS_VWO_VISITOR_TYPE=returning; AWSELB=C393A78D02CA3EE2799CF8894B23627240E8CACE66D1C0BB8AD720DF21EC8ACE1D897A32BEBC089642A0472335D0E12E2E117186F0CCDBF88A5E8AB2CD9F31FA13EA9CDBB3A68FF4DB78B55F4406384017E95C9573; AppVersion=r20161.6.0.834.574; _vwo_uuid_v2=0308785C38305F47209E7EC8811AC0A2|3ec2dd2ac5e7bfcc195a554e24406f22; osTimestamp=1472090234.391; WT_FPC=id=202.120.14.195-2899434048.30480412:lv=1472043437504:ss=1472043437504; fsr.s=%7B%22cp%22%3A%7B%22Usage_Session%22%3A%2220160825015947140%3A312846%22%2C%22cxreplayaws%22%3A%22true%22%2C%22Error_Page%22%3A%22no%22%2C%22No_Results%22%3A%22no%22%2C%22My_Research%22%3A%22no%22%2C%22Advanced%22%3A%22no%22%2C%22Professional%22%3A%22no%22%2C%22User_IP%22%3A%22202.120.19.186%22%2C%22Session_ID%22%3A%2269A1CC852FF1123A9A78CFC18E2B6AFF.i-b86ebbb9%22%2C%22Account_ID%22%3A%2213818%22%7D%2C%22v1%22%3A-2%2C%22v2%22%3A-2%2C%22rid%22%3A%22de07553-78562942-af91-5f91-ed200%22%2C%22ru%22%3A%22http%3A%2F%2Fourex.lib.sjtu.edu.cn%2Fprimo_library%2Flibweb%2Faction%2Fdisplay.do%3Bjsessionid%3D73028D8B75DB2FF259A0E736836BAA07%3Ftabs%3DdetailsTab%26ct%3Ddisplay%26fn%3Dsearch%26doc%3Dsjtulibxw000061822%26indx%3D1%26recIds%3Dsjtulibxw000061822%26recIdxs%3D0%26elementId%3D0%26renderMode%3DpoppedOut%26displayMode%3Dfull%26frbrVersion%3D%26dscnt%3D0%26scp.scps%3Dscope%253A%2528SJT%2529%252Cscope%253A%2528sjtu_metadata%2529%252Cscope%253A%2528sjtu_sfx%2529%252Cscope%253A%2528sjtulibzw%2529%252Cscope%253A%2528sjtulibxw%2529%252CDuxiuBook%26tab%3Ddefault_tab%26dstmp%3D1472033627266%26vl(freeText0)%3Dproquest%26vid%3Dchinese%22%2C%22r%22%3A%22ourex.lib.sjtu.edu.cn%22%2C%22st%22%3A%22%22%2C%22to%22%3A5%2C%22pv%22%3A34%2C%22lc%22%3A%7B%22d0%22%3A%7B%22v%22%3A34%2C%22s%22%3Atrue%7D%7D%2C%22cd%22%3A0%2C%22f%22%3A1472090225890%2C%22pn%22%3A0%2C%22sd%22%3A0%7D; _ga=GA1.3.1201070524.1446763952'} 
req = session.post(monthurl,data = payload,headers = headers) 
main = BeautifulSoup(req.text,"html.parser").decode('utf-8') 
print(main) 

결과 샘플 : ['/publication.publicationissuebrowse:openissue/issueName/02016Y08Y25$23Aug+25,+ 2016?site=news&t;:ac=publications_105983'] (이 목록입니다 난 단지 편의상 하나 개의 요소를 보여), 이 실제로 URL이없는 것입니다 : /publication.publicationissuebrowse:openissue/issueName/02016Y08Y25$23Aug+25,+ 2016?site=news&t:ac=publications_105983 에는 "A;" 및 ";" "t"다음에

그래서 실제로 여기에 두 가지 질문이 있습니다. 왜 이런 일이 발생합니까? 그것을 고치는 방법? 목록 요소의 특정 문자를 바꿀 수 있습니까?

답변

0

돌아 오는 것은 명백하게 웹 사이트에 삽입되어야합니다. &은 html로 이스케이프 처리 된 경우에만 &입니다. 그들은 동등하지만 먼저 이스케이프 처리해야합니다. 당신은이 ; 누락으로 https://wiki.python.org/moin/EscapingHtml

def unescape(s): 
    s = s.replace("&lt;", "<") 
    s = s.replace("&gt;", ">") 
    # this has to be last: 
    s = s.replace("&amp;", "&") 
    return s 

에 게시하는 기능을 가지고있다 - 그 JS에서 함께 웹 사이트 중 거래 뭔가, 아니면 두 URL은 잘 작동합니다. 이 코드에서는 실수가 아닙니다. 웹 사이트의 스크립트를주의 깊게 확인하십시오.

+0

감사합니다. @ viraptor, 늦은 응답을 받아서 죄송합니다. 솔루션이 잘 작동합니다. –