웹 페이지를 스크래치 프레임 워크로 스크랩하는 방법은 무엇입니까?

나는 webbrapping에 초보자입니다. 나는 치료 골격을 배우기 시작했습니다.웹 페이지를 스크래치 프레임 워크로 스크랩하는 방법은 무엇입니까?

나는 치료의 기본 자습서를 다뤘습니다. 이제 this 페이지를 스크랩하려고합니다. this 튜토리얼 당으로

, 전체 HTML 페이지를 얻는 하나가 다음 코드 작성해야 포함 :

import scrapy 


class ClothesSpider(scrapy.Spider): 
    name = "clothes" 

    start_urls = [ 
     'https://www.chumbak.com/women-apparel/GY1/c/', 
    ] 

    def parse(self, response): 
     filename = 'clothes.html' 
     with open(filename, 'wb') as f: 
      f.write(response.body)

이 코드 실행 벌금을. 그러나 나는 예상 된 결과를 얻지 못하고있다. 내가 clothes.html을 열 때

는 HTML 데이터 내가 브라우저에서 검사 할 때와 동일하지 않습니다. 많은 것들이 누락되었습니다. clothes.html

여기에 무슨 문제가 있는지 이해하지 못했습니다. 앞으로 나아가도록 도와주세요. 도움이 될 것입니다.

감사합니다.

출처

2017-12-18 Amit

브라우저의 검사 도구에는 사용자에게 HTML이 표시되지 않습니다. 그 순간에 존재하는 DOM을 보여줍니다. 아마도 JavaScript로 페이지가 수정되었을 것입니다. 소스보기 (Firefox 또는 Chrome에서 Ctrl + U)를 사용하면 치료와 동일하게 보입니다. – Thomas

오늘 많은 페이지가 동적이며 스스로 렌더링되는 경향이 있습니다. 헤드리스 브라우저 사용을 고려하십시오. – AndreyF

@ 토마스, 도와 주셔서 감사합니다. scrapy로 JS에 의해 수정 된 결과를 얻는 방법이 있습니까? – Amit

이 페이지는 자바 스크립트를 사용하여 페이지에 데이터를 저장합니다. 크롬/파이어 폭스에서 당신이 URL은 서버에서이 데이터 (탭 네트워크, 필터 XHR)

를 얻기 위해 자바 스크립트를 사용하여 그리고 당신이 너무 데이터를 얻기 위해 시도 할 수있는 볼 수 있습니다 DevTool을 사용하여

코드는 JSON 데이터가 포함 된 10 페이지의 URL을 생성하고 다운로드 한 다음 분리 된 파일로 저장하고 전체 URL을 이미지로 생성 한 후 하위 폴더 full으로 다운로드합니다. Scrapy도 output.json에 저장 yield 다운로드 이미지에 대한 데이터입니다.

#!/usr/bin/env python3 

import scrapy 
#from scrapy.commands.view import open_in_browser 
import json 

class MySpider(scrapy.Spider): 

    name = 'myspider' 

    #allowed_domains = [] 

    #start_urls = ['https://www.chumbak.com/women-apparel/GY1/c/'] 

    #start_urls = [ 
    # 'https://api-cdn.chumbak.com/v1/category/474/products/?count_per_page=24&page=1', 
    # 'https://api-cdn.chumbak.com/v1/category/474/products/?count_per_page=24&page=2', 
    # 'https://api-cdn.chumbak.com/v1/category/474/products/?count_per_page=24&page=3', 
    #] 

    def start_requests(self): 
     pages = 10 
     url_template = 'https://api-cdn.chumbak.com/v1/category/474/products/?count_per_page=24&page={}' 

     for page in range(1, pages+1): 
      url = url_template.format(page) 
      yield scrapy.Request(url) 

    def parse(self, response): 
     print('url:', response.url) 

     #open_in_browser(response) 

     # get page number 
     page_number = response.url.strip('=')[-1] 

     # save JSON in separated file 
     filename = 'page-{}.json'.format(page_number) 
     with open(filename, 'wb') as f: 
      f.write(response.body) 

     # convert JSON into Python's dictionary 
     data = json.loads(response.text) 

     # get urls for images 
     for product in data['products']: 
      #print('title:', product['title']) 
      #print('url:', product['url']) 
      #print('image_url:', product['image_url']) 

      # create full url to image 
      image_url = 'https://media.chumbak.com/media/catalog/product/small_image/260x455' + product['image_url'] 
      # send it to scrapy and it will download it 
      yield {'image_urls': [image_url]} 


     # download files 
     #for href in response.css('img::attr(href)').extract(): 
     # url = response.urljoin(src) 
     # yield {'file_urls': [url]} 

     # download images and convert to JPG 
     #for src in response.css('img::attr(src)').extract(): 
     # url = response.urljoin(src) 
     # yield {'image_urls': [url]} 

# --- it runs without project and saves in `output.csv` --- 

from scrapy.crawler import CrawlerProcess 

c = CrawlerProcess({ 
    'USER_AGENT': 'Mozilla/5.0', 

    # save in CSV or JSON 
    'FEED_FORMAT': 'json',  # 'cvs', 'json', 'xml' 
    'FEED_URI': 'output.json', # 'output.cvs', 'output.json', 'output.xml' 

    # download files to `FILES_STORE/full` 
    # it needs `yield {'file_urls': [url]}` in `parse()` 
    #'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1}, 
    #'FILES_STORE': '/path/to/valid/dir', 

    # download images and convert to JPG 
    # it needs `yield {'image_urls': [url]}` in `parse()` 
    #'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1}, 
    #'IMAGES_STORE': '/path/to/valid/dir', 
    'ITEM_PIPELINES': {'scrapy.pipelines.images.ImagesPipeline': 1}, 
    'IMAGES_STORE': '.', 
}) 
c.crawl(MySpider) 
c.start()

출처

2017-12-18 23:47:58 furas

웹 페이지를 스크래치 프레임 워크로 스크랩하는 방법은 무엇입니까?

답변

관련 문제