2017-12-11 9 views
0

일부 웹 페이지 스크랩 브라우저에서 검사 할 때와 같은 소스를 얻지 못합니다. 브라우저에서 소스를 볼 때 실제 하이퍼 링크 인 하이퍼 링크는 {url}으로 표시됩니다. 다음은 예제 페이지의 예제 코드입니다.일부 하이퍼 링크를 해결하지 못하는 파이썬 웹 스크랩

import requests 
from bs4 import BeautifulSoup as bs 
page = requests.get("https://www.mckinsey.com/search?q=iot") 
soup = bs(page.content, 'html.parser') 
soup.findAll('div', {'class' : 'item title-link'}) 

브라우저의 마지막 줄에있는 수프 요소를 검사하면 전체 URL입니다. 요청 버전에서 그것을 검사한다면 그것은 단지 {url}이라고 말하고 수프 객체를 얻는 것은 그냥 비어있게됩니다.

+0

당신은 URL이 생성되고 있기 때문에 자바 스크립트 – Dark

답변

1

이 포털은 JavaScript을 사용하여 서버에서 데이터를 가져 와서 페이지에 넣습니다.

Chrome/Firefox에서 DevTool을 사용하면 javaScript은 요청을 JSON 매개 변수로 보내고 모든 데이터를 JSON으로 가져옵니다. 당신이 그것을 얻으면 당신은 사전을 모두 가지고 있습니다.

import requests 

params = { 
    'q': 'iot', 
    'page': '1', 
    'app': '', 
    'sort': 'default', 
    'ignoreSpellSuggestion': 'false', 
} 

url = 'https://www.mckinsey.com/services/ContentAPI/SearchAPI.svc/search' 

for page in range(1, 3): 

    params['page'] = str(page) 

    r = requests.post(url, json=params) 

    data = r.json() 

    print() 
    print("data['data'].keys():\n", data['data'].keys()) 
    print()  
    print(' currentPage:', data['data']['currentPage']) 
    print(' totalPages:', data['data']['totalPages']) 
    print('totalResults:', data['data']['totalResults']) 
    print() 

    print("data['data']['results'][0].keys():\n", data['data']['results'][0].keys()) 
    print() 

    for item in data['data']['results']: 
     print(item['title']) 
     print(item['url']) 
     print('---') 

결과 :

data['data'].keys(): 
dict_keys(['totalResults', 'totalPages', 'currentPage', 'recommendations', 'results']) 

currentPage: 1 
    totalPages: 17 
totalResults: 166 

data['data']['results'][0].keys(): 
dict_keys(['title', 'subtitle', 'imageurl', 'dek', 'tag', 'mimetype', 'url']) 

Taking the pulse of enterprise <b>IoT</b> 
https://www.mckinsey.com/global-themes/internet-of-things/our-insights/taking-the-pulse-of-enterprise-iot 
--- 
An executive&#39;s guide to the Internet of Things 
https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/an-executives-guide-to-the-internet-of-things 
--- 
Internet of Things | Internet of Things 
https://www.mckinsey.com/global-themes/internet-of-things/how-we-help-clients 
--- 
Unlocking the potential of the Internet of Things 
https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/the-internet-of-things-the-value-of-digitizing-the-physical-world 
--- 
Internet of Things 
https://www.mckinsey.com/global-themes/internet-of-things/our-insights 
--- 
Six ways CEOs can promote cybersecurity in the <b>IoT</b> age 
https://www.mckinsey.com/global-themes/internet-of-things/our-insights/six-ways-ceos-can-promote-cybersecurity-in-the-iot-age 
--- 
What&#39;s new with the Internet of Things? 
https://www.mckinsey.com/industries/semiconductors/our-insights/whats-new-with-the-internet-of-things 
--- 
How can we recognize the real power of the Internet of Things? 
https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/how-can-we-recognize-the-real-power-of-the-internet-of-things 
--- 
Making sense of Internet of Things platforms 
https://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/making-sense-of-internet-of-things-platforms 
--- 
Partnerships, scale, and speed: The hallmarks of a successful <b>IoT</b> strategy 
https://www.mckinsey.com/industries/financial-services/our-insights/partnerships-scale-and-speed 
--- 

data['data'].keys(): 
dict_keys(['totalResults', 'totalPages', 'currentPage', 'recommendations', 'results']) 

currentPage: 2 
    totalPages: 17 
totalResults: 166 

data['data']['results'][0].keys(): 
dict_keys(['title', 'subtitle', 'imageurl', 'dek', 'tag', 'mimetype', 'url']) 

THE INTERNET OF THINGS: MAPPING THE VALUE BEYOND THE HYPE 
https://www.mckinsey.com/~/media/mckinsey/business%20functions/mckinsey%20digital/our%20insights/the%20internet%20of%20things%20the%20value%20of%20digitizing%20the%20physical%20world/unlocking_the_potential_of_the_internet_of_things_executive_summary.ashx 
--- 
The future of connectivity: Enabling the Internet of Things 
https://www.mckinsey.com/global-themes/internet-of-things/our-insights/the-future-of-connectivity-enabling-the-internet-of-things 
--- 
THE INTERNET OF THINGS: MAPPING THE VALUE BEYOND THE HYPE 
https://www.mckinsey.com/~/media/mckinsey/business%20functions/mckinsey%20digital/our%20insights/the%20internet%20of%20things%20the%20value%20of%20digitizing%20the%20physical%20world/the-internet-of-things-mapping-the-value-beyond-the-hype.ashx 
--- 
Insurers need to plug into the Internet of Things – or risk falling behind 
https://www.mckinsey.com/~/media/mckinsey/industries/financial%20services/our%20insights/european%20insurance%20practice%20report%20on%20internet%20of%20things/mckinsey%20-%20insurers%20need%20to%20plug%20into%20the%20internet%20of%20things%20or%20risk%20falling%20behind.ashx 
--- 
Security in the Internet of Things 
https://www.mckinsey.com/industries/semiconductors/our-insights/security-in-the-internet-of-things 
--- 
Semiconductors 
https://www.mckinsey.com/~/media/mckinsey/industries/semiconductors/our%20insights/mckinsey%20on%20semiconductors%20issue%206%20-%20spring%202017/mck%20on%20semiconductors_issue%206_2017.ashx 
--- 
Internet of Things: Opportunities and challenges for semiconductor companies 
https://www.mckinsey.com/industries/semiconductors/our-insights/internet-of-things-opportunities-and-challenges-for-semiconductor-companies 
--- 
THE INTERNET OF THINGS: MAPPING THE VALUE BEYOND THE HYPE 
https://www.mckinsey.com/~/media/mckinsey/business%20functions/mckinsey%20digital/our%20insights/the%20internet%20of%20things%20the%20value%20of%20digitizing%20the%20physical%20world/unlocking_the_potential_of_the_internet_of_things_full_report.ashx 
--- 
A new Internet of Things platform and business | Digital McKinsey 
https://www.mckinsey.com/business-functions/digital-mckinsey/how-we-help-clients/a-new-internet-of-things-platform-and-business 
--- 
Video meets the Internet of Things 
https://www.mckinsey.com/industries/high-tech/our-insights/video-meets-the-internet-of-things 
--- 
+0

으로,이를 위해'ghost.py'을 사용해야 할 수도 있습니다이 위대하다! 그래도 DevTools에서 찾은 SearchAPI URL을 찾는 데 문제가 있습니다. – textnet

+1

로드 oryginal 페이지에서 DevTools -> Network -> XHR로 이동하여 페이지를 다시로드하십시오. – furas

+0

감사합니다. 위의 예는 166 개의 결과 중 처음 10 개를 얻습니다. 나머지를 얻는 방법? – textnet