문제 : 나는 이와 같은 표 텍스트 파일을 html로 코드를 분석하고 검색 할 :파이썬 : 구문 분석 HTML과 표 텍스트 파일을 생성
East Counties
Babergh, http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml, 876
Basildon, http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml, 1134
...
...
내가 대신 무엇을 얻을 : 만 txt 파일에 나타나므로 for 루프가 각 새 영역을 인쇄하지 못합니다. 시도 코드는 html 코드 뒤에 있습니다.
HTML 코드 : 코드는이 인의 this html page에서 찾을 수 있습니다 위의 표를 참조하여 발췌 :
<h2>
East Counties</h2>
<table>
<thead>
<tr>
<th>
<span id="listRegions_lvFiles_0_titleLAName_0">Local authority</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleUpdate_0">Last update</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleEstablishments_0">Number of businesses</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleCulture_0">Download</span>
</th>
</tr>
</thead>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_0">Babergh</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_0">04/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_0"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_0">876</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_0" title="Babergh: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml">English language</a>
</td>
</tr>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_1">Basildon</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_1">06/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_1"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_1">1,134</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_1" title="Basildon: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml">English language</a>
</td>
</tr>
내 시도 :
from xml.dom import minidom
import urllib2
from bs4 import BeautifulSoup
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
regions=[]
with open('Regions_and_files.txt', 'w') as f:
for h2 in soup.find_all('h2')[6:]: #Skip 6 h2 lines
region=h2.text.strip() #Get the text of each h2 without the white spaces
regions.append(str(region))
f.write(region+'\n')
for tr in soup.find_all('tr')[1:]: # Skip headers
tds = tr.find_all('td')
if len(tds)==0:
continue
else:
a = tr.find_all('a')
link = str(a)[10:67]
span = tr.find_all('span')
places = int(str(span[3].text).replace(',', ''))
f.write("%s,%s,%s" % \
(str(tds[0].text)[1:-1], link, places)+'\n')
방법 이 문제를 해결할 수 있습니까?
'table'에 대해'h2' 검색을 중첩합니까? – FaCoffee
교착 상태를 해결했습니다. 엄청 고마워! – FaCoffee