0
동일한 문제가 있습니다. 내가 Nutch의 2.3.1 버전 SOLR의 5.2.1 버전을 사용하고 있습니다 그런데Nutchx2의 라운드 수를 사용하는 방법
crawl urls/ucuzcumSeed.txt ucuzcum http://localhost:8983/solr/ucuzcum/ 10
crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
: 나는 전체 과정 단지이 명령을 사용합니다. 문제는이 명령에 대한 전체 웹 사이트를 가져올 수 없다는 것입니다. numberofRounds 매개 변수가 작동하지 않는다고 가정합니다. 처음에는 nutch를 실행하여 가져 오기위한 URL을 하나 찾아서 생성하고 파싱합니다. 두 번째 단계에서 더 많은 URL을 얻을 수 있습니다. 이 경우 첫 번째 반복의 끝에서 nutch 중지를 의미합니다. 그러나 그것은 내 명령에 따라 계속되어야한다. nutch로 전체 웹 사이트를 크롤링하려면 어떻게해야합니까?
nutch-site.xml 파일은 :
<property>
<name>http.agent.name</name>
<value>MerveCrawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-rege$
</property>
<property>
<name>http.content.limit</name>
<value>-1</value><!-- No limit -->
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>fetcher.verbose</name>
<value>true</value>
<description>If true, fetcher will log more verbosely.</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>100000000000000000000000000000000000000000000</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>false</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>10</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server. Note that this might get
overriden by a Crawl-Delay from a robots.txt and is used ONLY if
fetcher.threads.per.queue is set to 1.
</description>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
<property>
<name>http.timeout</name>
<value>100000000000000000000000000000000000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>http.timeout</name>
<value>100000000000000000000000000000000000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>generate.max.count</name>
<value>100000000</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
해당 웹 사이트에 robots.txt가 있는지 확인했습니다. 그러나 나는 어떤 한계라도 발견하지 못한다. 전체 웹 사이트를 크롤링 할 수없는 다른 이유는 무엇입니까? – mrvsta