웹 채취를 사용하여 웹 사이트에서 URL을 추출하는 중

사이트 맵이없는 웹 사이트의 URL을 추출하려고합니다. 나는이 코드를 사용하고있다. Web Harvest tool 웹 채취를 사용하여 웹 사이트에서 URL을 추출하는 중

자바 나 코딩에 대해서는 잘 모른다. 누군가이 도구를 사용하여 나를 도울 수 있었습니까?

특정 웹 사이트 (예 : example.com)에서 실행하고 해당 웹 사이트의 모든 단일 URL을 추출하고 싶습니다.

출처

2013-07-14 Caleb

http://web-harvest.sourceforge.net/manual.php의 Web Harvest 사용자 설명서를 거쳐야합니다. 여기에는 여러 개의 예제가 있습니다.

출처

2013-07-17 05:39:06

Example.com은 단 하나의 링크 만 있으므로 아주 좋은 예는 아닙니다!

<?xml version="1.0" encoding="UTF-8"?> 

<config> 
     <!-- 1: provide inputs   --> 
     <script><![CDATA[ 
       url="http://stackoverflow.com/questions/17635763/trying-to-extract-urls-from-a-website-using-web-harvest"; 

       output_path = "C:/webharvest/"; 
       file_name = "urllist.txt";    
       output_file = output_path + file_name;     

      ]]></script> 

     <!-- 5 : save the resulting list in a variable  -->  
     <var-def name="urls"> 
      <!-- 4 : select only links (outputs a list variable)   -->  
      <xpath expression='//a/@href'> 
       <!-- 3 : convert it to XML, for querying   --> 
       <html-to-xml> 
        <!-- 2 : load the page  --> 
        <http url="${url}"/> 
       </html-to-xml> 
      </xpath> 
     </var-def> 

     <!-- 7: write to output file   --> 
     <file action="write" path="${output_file}"> 
      <!-- 6 : convert the list variable into a string with each link on a new line  --> 
      <text delimiter="${sys.cr}${sys.lf}"> 
      <var name="urls" /> 
      </text> 
     </file>    

</config>

: :)

는 여기에 몇 가지 주석을 내 코드입니다

출처

2014-05-08 14:34:46 user3616725

웹 채취를 사용하여 웹 사이트에서 URL을 추출하는 중

답변

관련 문제