2014-02-23 2 views
0

Web Harvest의 하위 링크에서 데이터를 수집 할 수있는 방법이 있습니까? 다음은웹 수확을 통한 하위 링크의 데이터 수집

는 XML 세그먼트는 내가 사용된다

<loop item="item" index="i"> 
      <list><var name="products"/></list> 
      <body> 
       <xquery> 
        <xq-param name="item"><var name="item"/></xq-param> 
        <xq-expression><![CDATA[ 
          declare variable $item as node() external; 
          for $i in $item//div[1]/p/a[@trace='auction'][1] 
          let $url := data($i/@href) 

어떻게 지금 $ URL이이 새로운 URL에 데이터베이스를 잡을 수 있습니까?

도와주세요. 고마워.

답변

0

이 정보를 포함하려면 다른 정보를 만들어야합니다. 당신이 쉽게 이해할 수 있도록 샘플을 만들었습니다. 보라하십시오

SCRIPT :

<?xml version="1.0" encoding="UTF-8"?> 
<config> 
    <var-def name="MainSite">http://www.appszoom.com/android_games/arcade_and_action</var-def> 
     <loop item="titles" index="i"> 
     <list> 
      <xpath expression="//li[@class='app captureLinkBox']/div/div/span/a"> 
       <html-to-xml> 
        <http url="${MainSite}"></http> 
       </html-to-xml> 
      </xpath> 
     </list> 
     <body> 
      <var-def name="titleURL"> 
        <xpath expression="data(/a/@href)"> 
         <var name="titles"/> 
        </xpath> 
      </var-def> 
      <file action="append" path="D:\navin.xml"> 
       <xquery> 
        <xq-param name="titles"><template>${titles}</template></xq-param> 
        <xq-param name="titleURLContent"> 
         <html-to-xml> 
          <http url="${titleURL}"></http> 
         </html-to-xml> 
        </xq-param> 
         <xq-expression> 
          <![CDATA[ 
          declare variable $titles as node() external; 
          declare variable $titleURLContent as node() external; 
          <game> 
           <title>{$titles/a/text()}</title> 
           <downloads>{$titleURLContent//*[@id="left-bar"]/p[2]/span/text()}</downloads> 
          </game> 
          ]]> 
         </xq-expression> 
       </xquery> 
      </file> 
     </body> 
    </loop> 
</config> 

출력 : 나를 실행을 확인, 그러나 이것은 당신에 가야 당신은, 전체 코드를 제공하지 않습니다

<game> 
    <title>Clash of Clans</title> 
    <downloads>10,000,000 - 50,000,000</downloads> 
</game> 
<game> 
    <title>DEER HUNTER 2014</title> 
    <downloads>10,000,000 - 50,000,000</downloads> 
</game> 
<game> 
    <title>Subway Surfers</title> 
    <downloads>100,000,000 - 500,000,000</downloads> 
</game> 
<game> 
    <title>RoboCop™</title> 
    <downloads>5,000,000 - 10,000,000</downloads> 
</game><game> 
    <title>DragonFlight for Kakao</title> 
    <downloads>10,000,000 - 50,000,000</downloads> 
</game> 
<game> 
    <title>Castle Clash</title> 
    <downloads>10,000,000 - 50,000,000</downloads> 
</game> 
<game> 
    <title>Sonic Dash</title> 
    <downloads>10,000,000 - 50,000,000</downloads> 
</game> 
<game> 
    <title>Injustice: Gods Among Us</title> 
    <downloads>1,000,000 - 5,000,000</downloads> 
</game> 
<game> 
    <title>Banana Kong</title> 
    <downloads>10,000,000 - 50,000,000</downloads> 
</game> 
<game> 
    <title>Temple Run 2</title> 
    <downloads>100,000,000 - 500,000,000</downloads> 
</game> 
0

너의 길 :

<config> 
    <loop item="item" index="i"> 
      <list><var name="products"/></list> 
      <body>   
       <var-def name="new_url"> 
       <xquery> 
        <xq-param name="item"><var name="item"/></xq-param> 
        <xq-expression><![CDATA[ 
          declare variable $item as node() external; 
          for $i in $item//div[1]/p/a[@trace='auction'][1] 
          let $url := data($i/@href) 
           return 
            {$url} 
        ]]></xq-expression> 
       </xquery> 
       </var-def> 

       <!-- now your new url is saved in webharvest variable new_url and you are free to run a 
       new webharvest http request using it --> 

       <var-def name="new_page_content"> 
        <http url="${new_url}"/> 
       </var-def>     

       <!-- now the content of the new page has been downloaded and saved in new variable 
       new_page_content and you are free to query it further should you want to --> 

       <var-def name="contact"> 
       <xpath expression="//a[contains(., 'contact')]/@href"> 
       <var name="new_page_content"/> 
       </xpath> 
      </body> 
    </loop>    
</config>