2014-12-31 3 views
2

전자 메일의 경우 Craigslist 스크랩을 사용하기 위해 온라인 자습서를 일부 수행했습니다. 나는이 코드를 가지고 있지만 명령을 실행하고 json 파일로 내보낼 때 파일을 생성하지만 거기에는 유일한 '['가 있습니다.Scrapy JSON 내보내기 문제

도움을 주시면 감사하겠습니다. 아래는 내 코드입니다.

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy_demo.items import ScrapyDemoItem 
import urlparse 
from scrapy.http.request import Request 

class ScrapyDemoSpider(BaseSpider): 
    name = "scrapy_demo" 
    allowed_domains = ["buffalo.craigslist.org"] 
    start_urls = ['http://buffalo.craigslist.org/search/cps/'] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     listings = hxs.select('//....') 
     links = [] 

     #scrap listings page to get listing links 
     for listing in listings: 
      link = listing.select('..../@href').extract()[0] 
      links.append(link) 

     #parse listing url to get content of the listing page 
     for link in links: 
      item = ScrapyDemoItem() 
      item['link'] = link 
      yield Request(urlparse.urljoin(response.url, link), meta={'item': item}, callback=self.parse_listing_page) 

      #get next button link 
      next_page = hxs.select("//..../@href").extract()[0] 
      if next_page: 
       yield Request(urlparse.urljoin(response.url, next_page), self.parse) 

    #scrap listing page to get content 
    def parse_listing_page(self, response): 
     hxs = HtmlXPathSelector(response) 
     item = response.request.meta['item'] 
     item['title'] = hxs.select('//..../text()').extract()[0] 
     item['content'] = hxs.select('//..../text()').extract()[0] 
     yield item 

답변

0

여기에 여러 가지 문제가 있습니다.

주 문제는 select() 호출 내 잘못된 식입니다. 그에서 제외

:

  • 사용 response.xpath() 또는 response.css() 더 이상
  • HtmlXPathSelector 없음 필요성 parse() 콜백에 Item 인스턴스를 초기화하고 meta에 전달할 필요가 없습니다. parse_listing_page() 콜백 response.url에서 URL을 가져

개선 작업 코드 : 당신이 거미를 실행하면

import urlparse 

from scrapy.spider import BaseSpider 
from scrapy.http.request import Request 

from scrapy_demo.items import ScrapyDemoItem 


class ScrapyDemoSpider(BaseSpider): 
    name = "scrapy_demo" 
    allowed_domains = ["buffalo.craigslist.org"] 
    start_urls = ['http://buffalo.craigslist.org/search/cps/'] 

    def parse(self, response): 
     # processing listings 
     for listing in response.css('p.row > a[data-id]'): 
      link = listing.xpath('@href').extract()[0] 
      yield Request(urlparse.urljoin(response.url, link), callback=self.parse_listing_page) 

     # following next page 
     next_page = response.xpath('//a[contains(@class, "next")]/@href').extract() 
     print next_page 
     if next_page: 
      yield Request(urlparse.urljoin(response.url, next_page[0]), callback=self.parse) 

    def parse_listing_page(self, response): 
     item = ScrapyDemoItem() 
     item['link'] = response.url 
     item['title'] = response.xpath('//title/text()').extract()[0].strip() 
     item['content'] = response.xpath('//section[@id="postingbody"]/text()').extract()[0].strip() 
     yield item 

는, 출력 JSON 파일에서, 당신은 볼 것이다 :

[ 
    {"content": "Using a web cam with your computer to video communicate with your loved ones has never been made easier and it's free (providing you have an Internet connection). With the click of a few buttons, you are sharing your live video and audio with the person you are communicating with. It's that simple. When you are seeing and hearing your grand kids live across the country or halfway around the world, web camming is the next best thing to being there!", "link": "http://buffalo.craigslist.org/cps/4784390462.html", "title": "Web Cam With Your Computer With Family And Friends"}, 
    {"content": "Looking to supplement or increase your earnings?", "link": "http://buffalo.craigslist.org/cps/4782757517.html", "title": "1k in 30 Day's"}, 
    {"content": "Like us on Facebook: https://www.facebook.com/pages/NFB-Systems/514380315268768", "link": "http://buffalo.craigslist.org/cps/4813039886.html", "title": "NFB SYSTEMS COMPUTER SERVICES + WEB DESIGNING"}, 
    {"content": "Like us on Facebook: https://www.facebook.com/pages/NFB-Systems/514380315268768", "link": "http://buffalo.craigslist.org/cps/4810219714.html", "title": "NFB Systems Computer Repair + Web Designing"}, 
    {"content": "I can work with you personally and we design your site together (no outsourcing or anything like that!) I'll even train you how to use your brand new site. (Wordpress is really easy to use once it is setup!)", "link": "http://buffalo.craigslist.org/cps/4792628034.html", "title": "I Make First-Class Wordpress Sites with Training"}, 
    ... 
] 
+0

완벽한을, 정말 고맙습니다. 매력처럼 작동합니다. – keithp