Scrapy 필터 중복

확인 웹 페이지에서 URL을 추출, 그래서 난 Scrapy을 사용하고 있습니다. 나는 현재 "snipplr.com/all/page"를 긁어서 페이지의 URL을 추출하려고합니다. 그런 다음 URL을 추출하기 위해 스파이더를 다시 실행할 때 CSV 파일을 읽음으로써 추출 된 URL을 필터링합니다. 그것은 계획 이었지만 어쨌든 결과를 무시하는 오류가 발생했습니다.Scrapy 필터 중복

프로세스 : 그러나

import scrapy 
import csv 

from scrapycrawler.items import DmozItem 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors import LinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import Request 

class DmozSpider(scrapy.Spider): 
name = "dmoz" 
allowed_domains = ["snipplr.com"] 


def start_requests(self): 
    #for i in xrange(1000): 
    for i in range(2, 5): 
     yield self.make_requests_from_url("http://www.snipplr.com/all/page/%d" % i) 


def parse(self, response): 
    for sel in response.xpath('//ol/li/h3'): 
     item = DmozItem() 
     #item['title'] = sel.xpath('a/text()').extract() 
     item['link'] = sel.xpath('a[last()]/@href').extract() 
     #item['desc'] = sel.xpath('text()').extract() 

     reader = csv.reader(open('items.csv', 'w+')) #think it as a list 
     for row in reader: 
      if item['link'] == row: 
       raise IgnoreRequest() 

      else: 
       f = open('items.csv', 'w') 
       f.write(item[link']) 
     yield item

, 내가받을 : 링크에 대한 크롤링 웹 페이지>

스파이더 코드 이미 IgnoreRequest/dropItem 다른 CSV 파일에 추가하는 경우 이미> 과거에 추출 된 경우 CSV는 파일 확인 서로에게 다음 번에 오버라이드 (override)하는 이런 이상한 결과가 나는 diffrent 페이지를 크롤링 대신, 내가 파일에 결과를 추가 찾고,하지 오버라이드 (override)

 clock/ 
/view/81327/chatting-swing-gui-tcp/ 
/view/82731/automate-system-setup/ 
/view/81215/rmi-factorial/ 
/view/81214/tcp-addition/ 
/view/81213/hex-octal-binary-calculator/ 
/view/81188/abstract-class-book-novel-magazine/ 
/view/81187/data-appending-to-file/ 
/view/81186/bouncing-ball-multithreading/ 
/view/81185/stringtokenizer/ 
/view/81184/prime-and-divisible-by-3/ 
/view/81183/packaging/ 
/view/81182/font-controller/ 
/view/81181/multithreaded-server-and-client/ 
/view/81180/simple-calculator/ 
/view/81179/inner-class-program/ 
/view/81114/cvv-dumps-paypals-egift-cards-tracks-wu-transfer-banklogins-/ 
/view/81038/magento-social-login/ 
/view/81037/faq-page-magento-extension/ 
/view/81036/slider-revolution-responsive-magento-extension/ 
/view/81025/bugfix-globalization/

코드에 오류가있을 수 있으므로 필요에 따라 코드를 수정하여 수정하십시오. 도움을 주셔서 감사합니다.

편집 : 오타

출처

2015-01-04 CharlieC

당신은 실제로 Item Pipeline에서 수행해야 크롤링 데이터를 출력, 잘못된 장소에하고 있습니다.

일반 데이터베이스를 사용하고 데이터베이스 제약 조건을 사용하여 중복 된 필터를 사용하는 것이 더 좋지만 어쨌든 csv 파일로 작업하려는 경우 먼저 기존 내용을 읽고 나중에 확인하기 위해 기억하는 파이프 라인을 만들고, 모든 항목이 거미 검사에서 파이프 것은 전에 볼 수없는 경우가 아닌 경우 쓰기 :

ITEM_PIPELINES = { 
    'myproject.pipelines.CsvWriterPipeline': 300 
}

그리고 당신의 parse() 콜백 :

import csv 

from scrapy.exceptions import DropItem 


class CsvWriterPipeline(object): 
    def __init__(self): 
     with open('items.csv', 'r') as f: 
      self.seen = set([row for row in f]) 

     self.file = open('items.csv', 'a+') 

    def process_item(self, item, spider): 
     link = item['link'] 

     if link in self.seen: 
      raise DropItem('Duplicate link found %s' % link) 

     self.file.write(link) 
     self.seen.add(link) 

     return item

에 그것을 설정하는 ITEM_PIPELINES에 추가 Item :

def parse(self, response): 
    for sel in response.xpath('//ol/li/h3'): 
     item = DmozItem() 
     item['link'] = sel.xpath('a[last()]/@href').extract() 

     yield item

출처

2015-01-04 06:14:28 alecxe

이, 잘 작동 파이프 라인과 미들웨어의 차이점, 나는 둘 다 같은 코드로 같은 기능을 수행 할 수있는 전두엽인가? – CharlieC

@CharlieC 파이프 라인은 거미에서 반환 된 크롤링 된 항목을 후 처리합니다. 데이터베이스에 항목을 저장하는 데 사용됩니다. 미들웨어는 요청/응답 수준에서 작동합니다. 희망을 조금 지워줍니다. – alecxe

아 내가, 내가 보는 정보 – CharlieC

처음부터 쓰기 전용으로 파일을 여는 것입니다. 파일에 추가하려면 'a' 또는 'a+'을 사용해야합니다.

교체 BSD Library Functions Manual for fopen을 바탕으로

f = open('items.csv', 'a')

와

f = open('items.csv', 'w')

The argument mode points to a string beginning with one of the following 
sequences (Additional characters may follow these sequences.): 

``r'' Open text file for reading. The stream is positioned at the 
     beginning of the file. 

``r+'' Open for reading and writing. The stream is positioned at the 
     beginning of the file. 

``w'' Truncate file to zero length or create text file for writing. 
     The stream is positioned at the beginning of the file. 

``w+'' Open for reading and writing. The file is created if it does not 
     exist, otherwise it is truncated. The stream is positioned at 
     the beginning of the file. 

``a'' Open for writing. The file is created if it does not exist. The 
     stream is positioned at the end of the file. Subsequent writes 
     to the file will always end up at the then current end of file, 
     irrespective of any intervening fseek(3) or similar. 

``a+'' Open for reading and writing. The file is created if it does not 
     exist. The stream is positioned at the end of the file. Subse- 
     quent writes to the file will always end up at the then current 
     end of file, irrespective of any intervening fseek(3) or similar.

내가 아니라, 무엇을 할 수 있지만 아, 당신의 Alecxa에 대한 감사, 많이 도움이

출처

2015-01-04 13:20:43

를 들어, 정보에 대한 감사 감사를 참조하십시오. – CharlieC

답변

관련 문제