웹 스크래핑 테이블을 JSON에

나는 내 파이썬 기술이 훨씬 짧다고 생각합니다. 오픈 데이터 웹 사이트에서 석유 생산 데이터를 긁어 내고이를 json 형식으로 변환하려고합니다.웹 스크래핑 테이블을 JSON에

현재 헤더와 행 데이터 목록을 만들기 위해 html table 태그를 실행했습니다. 내가 고민하는 것은 하나의 json 레코드 내에이 데이터 중 일부를 중첩하는 것입니다.

지금 당장 나는 행마다 또는 헤더별로 모든 json을 갖습니다. 헤더와 그 아래에있는 모든 컬럼 데이터를 가지고 있고, 그 다음 헤더는 다음 헤더로 넘어갑니다.

가능하면 머리글과 행 데이터를 하나의 레코드로 할당하고 싶습니다. 다음 레코드는 머리글을 다시 가질 수 있지만 행 2 데이터는 다음 헤더와 행 3이있는 다음 json 레코드에 저장됩니다. 다른 옵션은 표 원본을 보면 의미가 있습니다. 오일 필드 당 기록이 있어야하며,이 필드는 생산 년/월별로 여러 행을 가질 수 있습니다.

가능한 경우 json 필드 기록에서 모든 정보를 캡처하고 싶습니다. json 레코드 내에는 하나의 json 레코드에 캡처 된 여러 해/월 행이 있어야합니다.

본질적으로 그것이 다른 파이썬과 약간의 루프를 사용해야 할 필요가 있다고 생각합니다. 이는 별개의 html 테이블 셀에서 가능합니다. 이것은 생각할 수있는 것이 단지 내 파이썬 기능 밖에 있습니다. 희망이 꽤 이해하기 위해 노력하는 의미합니다. (: 월 필드 연결된 년)

{ 
" Field (Discovery)":"asset1" , 
" Oil – saleable": 
    [ 
    { "yearmonth":"201701","unit":"mmboe","value":"1234.456"},      
    { "yearmonth ":"201702","unit":"mmboe","value":"124.46"}], 
"Gas - saleable":    
    [ 
    {"yearmonth":"201701","unit":" bill Sm3","value":"1234.456"}, 
    {" yearmonth ":"201702","unit":"mill Sm3","value":"14.456"}], 
"NGL - saleable ": 
    [ 
    {"yearmonth":"201704","unit":"mill Sm3","value":"1.456"}, 
    {" yearmonth ":"201706","unit":" bill Sm3","value":"14.456"}], 
"Condensate – saleable": 
    [ 
    {"yearmonth":"201701","unit":" mill Sm3","value":"23.60"}, 
    {"yearmonth ":"201608","unit":"mill Sm3","value":"4.4"}], 
    "NPDID information carrier":"43765" 
}

현재 파이썬 행

[ 
"\u00c5SGARD", 
"2017", 
"8", 
"0.19441", 
"0.81545", 
"0.26954", 
"0.00000", 
"1.27940", 
"0.07432", 
"43765" 
]

당 헤더 (이 분야에서 모든 데이터 열) 레코드 당 원하는

[ 
"Field (Discovery)":[cell1,cell2,cell3,cell4 etc] 
"Year":[cell1,cell2,cell3,cell4 etc], 
"Month":[cell1,cell2,cell3,cell4 etc], 
"Oil - saleable div[mill Sm3]":[cell1,cell2,cell3,cell4 etc], 
"Gas - saleable div[bill Sm3]":[cell1,cell2,cell3,cell4 etc], 
"NGL - saleable div[mill Sm3]", 
"Condensate - saleable div[mill Sm3]", 
"Oil equivalents - saleable div[mill Sm3]", 
"Water - wellbores div[mill Sm3]", 
"NPDID information carrier" 
]

에 의해

JSON 스크립트

import requests 
from bs4 import BeautifulSoup 
import json 
import boto3 
import botocore 
from datetime import datetime 
from collections import OrderedDict 

starttime = datetime.now() 

#Agent detail to prevent scraping bot detection 
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 
Safari/537.36' 

header = {'User-Agent' : user_agent } 

# Webpage connection 
html = "http://factpages.npd.no/ReportServer?/FactPages/TableView/ 
field_production_monthly&rs:Command=Render&rc:Toolbar=false 
&rc:Parameters=f&Top100=True&IpAddress=108.171.128.174&CultureCode=en" 

r=requests.get(html, headers=header) 
c=r.content 
soup=BeautifulSoup(c,"html.parser") 


rows = soup.findAll('td', { 
'class': ['a61cl', 'a65cr', 'a69cr', 'a73cr', 'a77cr', 'a81cr', 
'a85cr','a89cr', 'a93c', 'a97cr']}) 

headers = soup.findAll('td', { 
'class': ['a20c', 'a24c', 'a28c', 'a32c', 'a36c', 'a40c', 'a44c', 
'a48c','a52c', 'a56c']}) 

headers_list = [item.getText('div') for item in headers] 

rows_list=[item.getText('div') for item in rows] 

final=[rows_list[item:item+10] for item in range(0,len(rows_list),10)] 

row_header={} 
for item in final: 
    for indices in range(0,10): 
    if headers_list[indices] not in row_header: 
     row_header[headers_list[indices]]=[item[indices]] 
     else: 
     row_header[headers_list[indices]].append(item[indices]) 


#OrderedDict 
result= (json.dumps(row_header, indent=4, sort_keys=True, 
ensure_ascii=False)) 
with open('data.txt', 'wt') as outfile: 

#json dump print 
json.dump(result, outfile,indent=4, sort_keys=True, 
separators=(',', ': '), ensure_ascii=False) 


#Time 
runtime = datetime.now() - starttime 
print(runtime)

출처

2017-11-14 Chris

HTML을 보지 않고도 말하기는 어렵지만 표를 파싱하는 것이 다음과 같이 더 잘 수행 될 것이라고 짐작합니다. python BeautifulSoup parsing table. 그렇게하면 헤더와 값이 일치하는지 확인할 수 있습니다.

나는 답을주는 것이 교육에 불만이 될 것이라고 생각합니다. (! 행) : 그래서 다음이 도움이되기를 바랍니다

우리가 항목의 목록이있는 경우 [ ('a', 1), {'a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)] 우리는 우리가 사전을 추가 할 수 [ ('a': [1, 2, 3]), ('b': [1, 2, 3]) ]

의 목록으로 그들을 켜려고

d = {} 
for v1, v2 in items: 
    d.setdefault(v1, []).append(v2)

setdefault 여기는 기본적으로 목록에 대한 목록이있는 바로 가기입니다.

이 후 d.items()에는 값이 있습니다. 다음 단계는 헤더를 통합하는 것입니다. 처음에는 하나의 열 Oil – saleable부터 시작하여 거기에서 빌드하십시오.

출처

2017-11-14 20:01:55 typingduck

도움을 주셔서 감사합니다. 도움을 주셔서 감사 드리며 직접 도와드립니다.나중에 약간의 BS 사전을 구현할 수 있는지 알아보기 위해 조금 공부할 것입니다. – Chris

도움이된다면 여기서 질문을 명확히 해달라고 부탁하십시오 :) – typingduck

중첩 된 사전을 조금 더 이해했는데 지금은 약간 잃어 버리면 완전히 확신 할 수 없습니다. 기본적으로 다음 행을 하나씩 만들려면 각 행에 대해 루프를 돌리고 내부적으로''yearmonth ''에 따라''fieldname'' 레코드마다 여러 행을 중첩해야합니다. [코드] (https://pyfiddle.io/fiddle/909beab3-d34a-4099-b8fc-04eaecfcd85b/?i=true) – Chris

답변

관련 문제