Scrap 파이프 라인 SQL 구문 오류

나는 MySQL 데이터베이스에서 URL을 가져 와서 그 URL을 긁어내는 start_urls로 사용하며 차례로 긁힌 페이지에서 새로운 링크를 가져옵니다. start_url과 새 스크랩 한 URL을 새 DB에 삽입하도록 파이프 라인을 설정하거나 WHERE 조건으로 start_url을 사용하여 새로 스크랩 한 URL로 기존 DB를 업데이트하도록 파이프 라인을 설정할 때 SQL 구문 오류가 발생합니다.Scrap 파이프 라인 SQL 구문 오류

하나만 삽입하면 오류가 발생하지 않습니다. 여기

는

import MySQLdb 
import MySQLdb.cursors 
import hashlib 
from scrapy import log 
from scrapy.exceptions import DropItem 
from twisted.enterprise import adbapi 
from youtubephase2.items import Youtubephase2Item 

class MySQLStorePipeline(object): 
    def __init__(self): 
     self.conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True) 
     self.cursor = self.conn.cursor() 

    def process_item(self, item, spider): 
     try: 

      #self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['newurl'], item['start_url'])) 
      #self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s""",(item['newurl'], item['start_url'])) 
      self.cursor.execute("""INSERT INTO TestResults (NewURL, StartURL) VALUES (%s, %s)""",(item['newurl'], item['start_url'])) 
      self.conn.commit() 


     except MySQLdb.Error, e: 
      log.msg("Error %d: %s" % (e.args[0], e.args[1])) 

     return item

최상위 SQL 문 반환이 오류를 실행 세 self.cursor.execute 문을 보여줍니다 pipeline.py spider.py 여기

import scrapy 
import MySQLdb 
import MySQLdb.cursors 
from scrapy.http.request import Request 

from youtubephase2.items import Youtubephase2Item 

class youtubephase2(scrapy.Spider): 
    name = 'youtubephase2' 

    def start_requests(self): 
     conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True) 
     cursor = conn.cursor() 
     cursor.execute('SELECT resultURL FROM SearchResults;') 
     rows = cursor.fetchall() 

     for row in rows: 
      if row: 
       yield Request(row[0], self.parse, meta=dict(start_url=row[0])) 
     cursor.close() 

    def parse(self, response): 
     for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'): 
      item = Youtubephase2Item() 
      item['newurl'] = sel.xpath('@href').extract() 
      item['start_url'] = response.meta['start_url'] 
      yield item

입니다 :

2017-04-13 18:29:34 [scrapy.core.scraper] ERROR: Error processing {'newurl': [u'http://www.tagband.co.uk/'], 
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'} 
Traceback (most recent call last): 
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks 
current.result = callback(current.result, *args, **kw) 
File "/root/scraping/youtubephase2/youtubephase2/pipelines.py", line 18, in process_item 
self.cursor.execute("""UPDATE SearchResults SET AffiliateURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['affiliateurl'], item['start_url'])) 
File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 159, in execute 
query = query % db.literal(args) 
TypeError: not enough arguments for format string

중간 SQL 실행 문은 다음 오류를 반환합니다.

2017-04-13 18:33:18 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ') WHERE ResultURL = 'https://www.youtube.com/watch?v=UqguztfQPho'' at line 1 
2017-04-13 18:33:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho> 
{'newurl': [u'http://www.tagband.co.uk/'], 
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}

마지막 SQL 실행 문은 INSERT를 새 데이터베이스에 사용하는 경우에도 중간과 동일한 오류를 반환합니다. 여분의 작은 따옴표를 추가하는 것 같습니다. 마지막 하나는 데이터베이스에 항목 중 하나만 INSERT 할 때 작동합니다.

2017-04-13 18:36:40 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'https://www.youtube.com/watch?v=UqguztfQPho')' at line 1 
2017-04-13 18:36:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho> 
{'newurl': [u'http://www.tagband.co.uk/'], 
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'}

긴 게시물에 대해 죄송합니다. 철저한 시도.

출처

2017-04-13 SDailey

나는 이것을 알아 냈다. 이 문제는 MySQL 실행 파이프 라인에 목록을 전달한다는 사실과 관련이 있습니다.

목록을 ".join (item ''newurl ')을 사용하여 문자열로 변환하고 MySQL 파이프 라인에 도달하기 전에 항목을 반환하는 파이프 라인을 만들었습니다.

spider.py에서 [ 'newurl'] = sel.xpath ('@ href') extract() 행을 변경하여 목록의 첫 번째 항목을 추출하거나 그 항목을 변환하는 것이 더 나은 방법 일 수 있습니다. 텍스트하지만 파이프 라인 나를 위해 일했다.

출처

2017-04-13 20:33:03 SDailey

네, 첫 번째 요소를 선택하는 관용적 인 방법이 있습니다 :'item [ 'newurl'] = sel.xpath ('@ href'). extract_first()' –

글쎄, 나는 어리 석다. 나는 그것을 전에 사용했는데,이 경우에는 쉬운 해결책이 될 것이라고 깨닫지 못했다. 감사. – SDailey

바보 같지 마십시오. 이 정보를 찾지 못하면 선택자 문서가 개선 될 수 있음을 의미합니다 (https://docs.scrapy.org/en/latest/topics/selectors.html을 읽었다 고 가정) –

답변

관련 문제