Scrapy Tor Privoxy & UserAgent를 사용하여 익명으로 스크랩하는 방법? (Windows 10)

정보가 흩어져 있고 질문 제목이 오해의 소지가 있으므로이 질문의 대답을 찾기가 매우 어려웠습니다. 아래 답변은 한 곳에서 필요한 모든 정보를 재편성합니다.Scrapy Tor Privoxy & UserAgent를 사용하여 익명으로 스크랩하는 방법? (Windows 10)

출처

2017-12-21 J. Does

거미가 생겼습니다.

# based on https://doc.scrapy.org/en/latest/intro/tutorial.html 

import scrapy 
import requests 

class QuotesSpider(scrapy.Spider): 
    name = "quotes" 

    def start_requests(self): 
     urls = [ 
      'http://quotes.toscrape.com/page/1/', 
      'http://quotes.toscrape.com/page/2/', 
     ] 
     for url in urls: 
      print('\n\nurl:', url) 
     ## use one of the yield below 

      # middleware will process the request 
      yield scrapy.Request(url=url, callback=self.parse) 

      # check if Tor has changed IP 
      #yield scrapy.Request('http://icanhazip.com/', callback=self.is_tor_and_privoxy_used) 


    def parse(self, response): 
     page = response.url.split("/")[-2] 
     filename = 'quotes-%s.html' % page 
     with open(filename, 'wb') as f: 
      f.write(response.body) 
     print('\n\nSpider: Start') 
     print('Is proxy in response.meta?: ', response.meta) 
     print ("user_agent is: ",response.request.headers['User-Agent']) 
     print('\n\n Spider: End') 
     self.log('Saved file --- %s' % filename) 


    def is_tor_and_privoxy_used(self, response): 
     print('\n\nSpider: Start') 
     print("My IP is : " + str(response.body)) 
     print("Is proxy in response.meta?: ", response.meta) # not header dispo 
     print('\n\nSpider: End') 
     self.log('Saved file %s' % filename)

또한 middleware.py와 settings.py에 물건을 추가해야합니다. 어떻게 해야할지 모르겠다면 this will help you

출처

2017-12-21 16:15:19

Scrapy Tor Privoxy & UserAgent를 사용하여 익명으로 스크랩하는 방법? (Windows 10)

답변

관련 문제