WebCrawler, 할인 된 가격의 항목이 일부 있음 - 색인 오류

프로그래밍에 익숙하지 않고 파이썬으로 처음으로 작은 웹 크롤러를 만들려고합니다.WebCrawler, 할인 된 가격의 항목이 일부 있음 - 색인 오류

목표 :는 제품 목록 페이지를 크롤링 -

상태 파일을 CSV로 저장 - 브랜드 이름, 문서 이름, 원래 가격과 새로운 가격 스크 레이 핑 : 나는 브랜드 이름을 얻기 위해 관리했습니다, 기사를 이름뿐 아니라 원래 가격을 입력하고 올바른 순서대로 목록에 넣으십시오 (예 : 10 개 제품). 모든 항목에 대해 브랜드 이름, 설명 및 가격이 있으므로 내 코드는 올바른 순서로 CSV에 가져옵니다.

코드 :

import bs4 
    from urllib.request import urlopen as uReq 
    from bs4 import BeautifulSoup as soup 

    myUrl = 'https://www.zalando.de/rucksaecke-herren/' 

    #open connection, grabbing page, saving in page_html and closing connection 
    uClient = uReq(myUrl) 
    page_html = uClient.read() 
    uClient.close() 

    #Datatype, html paser 
    page_soup = soup(page_html, "html.parser") 

    #grabbing information 
    brand_Names = page_soup.findAll("div",{"class": "z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn"}) 
    articale_Names = page_soup.findAll ("div",{"class": "z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn"}) 
    original_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_originalPrice-2Oy4G"}) 
    new_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_promotionalPrice-3GRE7"}) 

    #opening a csv file and printing its header 
    filename = "XXX.csv" 
    file = open(filename, "w") 
    headers = "BRAND, ARTICALE NAME, OLD PRICE, NEW PRICE\n" 
    file.write(headers) 

    #How many brands on page? 
    products_on_page = len(brand_Names) 

    #Looping through all brands, atricles, prices and writing the text into the CSV 
    for i in range(products_on_page): 
      brand = brand_Names[i].text 
      articale_Name = articale_Names[i].text 
      price = original_Prices[i].text 
      new_Price = new_Prices[i].text 
      file.write(brand + "," + articale_Name + "," + price.replace(",",".") + new_Price.replace(",",".") +"\n") 

    #closing CSV 
    file.close()

문제 : 적절한 장소에 내 CSV로 할인 된 가격을 받고 고민하고. 아니 모든 항목이 할인을 가지고 있으며, 나는 현재 내 코드에 두 가지 문제 참조 : 나는 웹 사이트에있는 정보를 찾기 위해 .findAll를 사용

을 - 다음 전체 제품 덜 할인 된 제품이 있기 때문에, 내 new_Prices 적은 포함 가격 (예 : 10 개 제품의 경우 3 개 가격). 만약 내가 목록에 추가 할 수있을 것이라고, 나는 그들이 처음 3 행에 나타납니다 가정합니다. 올바른 prodcuts에 new_Prices를 추가하려면 어떻게해야합니까?
"색인 오류 : 목록 색인 범위를 벗어남"오류가 발생합니다. 오류는 10 개 제품을 통해 반복되고 있지만 new_Prices는 다른 목록보다 빨리 끝납니다. ? 그게 말이 되겠습니까? 내 가정이 맞습니까?

나는 많은 도움을 주심.

감사,

토르스텐

출처

2017-11-05 Thorstein Torento

코드의 스크린 샷을 게시하지 마시고 관련 코드를 코드 블록에 복사하십시오. – bgse

입력 예제도 게시하십시오 – Guilherme

@bgse 코드로 블록으로 업데이트 –

일부 항목 때문에 당신이 안정적으로 목록 인덱스를 사용할 수없는 'div.z-nvg-cognac_promotionalPrice-3GRE7' 태그가 없습니다.
그러나 모든 컨테이너 태그 ('div.z-nvg-cognac_infoContainer-MvytX')를 선택하고 find을 사용하여 각 항목의 태그를 선택할 수 있습니다.

당신은 페이지 당 24 개 이상의 항목을 얻고 싶다면

from urllib.request import urlopen 
from bs4 import BeautifulSoup as soup 
import csv 

my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/' 
client = urlopen(my_url) 
page_html = client.read().decode(errors='ignore') 
page_soup = soup(page_html, "html.parser") 

headers = ["BRAND", "ARTICALE NAME", "OLD PRICE", "NEW PRICE"] 
filename = "test.csv" 
with open(filename, 'w', newline='') as f: 
    writer = csv.writer(f) 
    writer.writerow(headers) 

    items = page_soup.find_all(class_='z-nvg-cognac_infoContainer-MvytX') 
    for item in items: 
     brand_names = item.find(class_="z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn").text 
     articale_names = item.find(class_="z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn").text 
     original_prices = item.find(class_="z-nvg-cognac_originalPrice-2Oy4G").text 
     new_prices = item.find(class_="z-nvg-cognac_promotionalPrice-3GRE7") 
     if new_prices is not None: 
      new_prices = new_prices.text 
     writer.writerow([brand_names, articale_names, original_prices, new_prices])

당신은 selenium처럼, JS를 실행하는 클라이언트를 사용해야합니다.

from selenium import webdriver 
from bs4 import BeautifulSoup as soup 
import csv 

my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/' 
driver = webdriver.Firefox() 
driver.get(my_url) 
page_html = driver.page_source 
driver.quit() 
page_soup = soup(page_html, "html.parser") 
...

각주 :
함수와 변수에 대한 naming conventions는 밑줄 소문자.
csv 파일을 읽거나 쓸 때 csv lib를 사용하는 것이 가장 좋습니다.
파일을 처리 할 때 with 문을 사용할 수 있습니다.

출처

2017-11-09 20:50:56

안녕하세요. @ t.m.adam, 많은 의견과 제안에 감사드립니다! 나는 마침내 거기에 갔지만 코드는 훨씬 깨끗 해 보인다! 하나는 페이지에 24 개 이상의 항목이 있으므로 페이지가 변경되었을 것임을 알았습니다. 이상하게도 크롤러를 실행할 때 24 개의 항목 만 선택합니다. 왜 그런가? –

예, 나머지 항목은 js에 의해로드됩니다. 브라우저에서 js를 사용 중지하고 페이지를 방문하면이를 테스트 할 수 있습니다. 'selenium '으로 모든 항목을 가져올 수도 있고 때로는 ajax API를 통해 항목을 가져올 수도 있습니다. 나는 자유 시간을 가질 때 모범을 보일 것이다. –

안녕하세요. @ t.m.adam, 훌륭합니다! 감사! 관심이 없으면 페이지를 그런 방식으로 설정해야합니다 (24 개 항목을로드 한 다음 나머지를 JS를 통해로드). 고마워, T –

WebCrawler, 할인 된 가격의 항목이 일부 있음 - 색인 오류

답변

관련 문제