요소 내 img src에 대한 xpath

html을 포함하는 description 요소 내에있는 이미지의 원본을 선택하도록 아래 코드를 어떻게 수정합니까? 지금은 요소 안의 전체 텍스트를 가져오고 모든 img 태그의 소스를 얻기 위해이를 수정하는 방법을 모르겠습니다.요소 내 img src에 대한 xpath

print(description.xpath("//img/@src"))

'없음'

XML 구조가 저를 부여하지 않습니다 :

<guides> 
<guide> 
    <id>guide 1</id> 
    <group> 
    <id></id> 
    <type></type> 
    <name></name> 
    </group> 
    <pages> 
     <page> 
      <id>page 1</id> 
      <name></name> 
      <description>&lt;p&gt;Some text. &lt;br /&gt;&lt;img 
      width=&quot;81&quot; 
      src=&quot;http://www.example.com/img.jpg&quot; 
      alt=&quot;wave&quot; height=&quot;63&quot; style=&quot;float: 
       right;&quot; /&gt;&lt;/p&gt;</description> 
      <boxes> 
       <box> 
        <id></id> 
        <name></name> 
        <type></type> 
        <map_id></map_id> 
        <column></column> 
        <position></position> 
        <hidden></hidden> 
        <created></created> 
        <updated></updated> 
        <assets> 
         <asset> 
          <id></id> 
          <name></name> 
          <type></type> 
         <description>&lt;img src=&quot;https://www.example.com/image.jpg&quot; alt=&quot;image&quot; height=&quot;42&quot; width=&quot;42&quot;&gt;</description> 
          <url/> 
          <owner> 
           <id></id> 
           <email></email> 
           <first_name></first_name> 
           <last_name></last_name> 
          </owner> 
         </asset> 
        </assets> 
       </box> 
      </boxes> 
     </page> 
    </pages> 
</guide>

출처

2017-11-03 podusmonens

description 요소의 내용은 HTML입니다. 구문 분석에는 다양한 방법이 있으며, 그 중 하나는 htmllxml입니다.

>>> description.text 
'<img src="https://www.example.com/image.jpg" alt="image" height="42" width="42">' 
>>> from lxml import html 
>>> img = html.fromstring(description.text) 
>>> img.attrib['src'] 
'https://www.example.com/image.jpg'

편집, 응답 언급하기 :

>>> from lxml import etree, html 
>>> tree = etree.parse('temp.xml') 
>>> for guide in tree.xpath('guide'): 
...  '---', guide.xpath('id')[0].text 
...  for pages in guide.xpath('.//pages'): 
...   for page in pages: 
...    '------', page.xpath('id')[0].text 
...    for description in page.xpath('.//asset/description'): 
...     '---------', html.fromstring(description.text).attrib['src'] 
... 
('---', 'guide 1') 
('------', 'page 1') 
('---------', 'https://www.example.com/image.jpg')

편집 : 예외 처리를.

교체와

'---------', html.fromstring(description.text).attrib['src']

try: 
    '---------', html.fromstring(description.text).attrib['src'] 

except KeyError: 
    '--------- No image URL present'

편집, 9, 11 월 코멘트에 응답 : 2 가이드 요소가 더 HTML을 전혀 포함하지 않는 XML 파일에 대한

from lxml import etree, html 
tree = etree.parse('guides.xml') 
for guide in tree.xpath('guide'): 
    print('---', guide.xpath('id')[0].text) 
    for pages in guide.xpath('.//pages'): 
     for page in pages: 
      print('------', page.xpath('id')[0].text) 
      for description in page.xpath('.//asset/description'): 
       try: 
        print('---------', html.fromstring(description.text).attrib['src']) 
       except TypeError: 
        print('--------- no src identifiable') 
       except KeyError: 
        print('--------- no src identifiable')

출력하고, 3 번째는 src 속성없이 HTML을 포함합니다.

--- guide 1 
------ page 1 
--------- https://www.example.com/image.jpg 
--- guide 2 
------ page 1 
--------- no src identifiable 
--- guide 3 
------ page 1 
--------- no src identifiable 
--- guide 4 
------ page 1 
--------- https://www.example.com/image.jpg

출처

2017-11-03 15:33:02

를 반환의 부하를 반환 이것을 for 루프에 통합 할 수있는 방법이 있습니까? – podusmonens

편집을 참조하십시오. –

description 요소 중 하나에 이미지 URL이 포함되어 있지 않은 경우 계속 구문 분석을 계속 하시겠습니까? 아니면 멈출 것입니까? 내 XML을 실행할 때 'KeyError :'src '가 표시됩니다. – podusmonens

당신에게

>>> from lxml import etree 
>>> tree = etree.parse('temp.xml') 
>>> for guide in tree.xpath('guide'): 
...  '---', guide.xpath('id')[0].text 
...  for pages in guide.xpath('.//pages'): 
...   for page in pages: 
...    '------', page.xpath('id')[0].text 
...    for description in page.xpath('.//asset/description'): 
...     '---------', description.text

은 또한 마지막에이 시도 시도해 볼 수있다 this 솔루션 :

description.xpath("//img/@src")

출처

2017-11-03 10:50:05

이것은 단지 빈은 [] – podusmonens

description.text 나에게 텍스트 만 description.xpath ("// IMG/@의 SRC")와 같은 이미지 URL을 제공합니다 왜 이해가 안 돼요 [] – podusmonens

답변

관련 문제