2011-11-23 9 views
7

웹 크롤러의 일반적인 공손 요인은 무엇입니까? 그렇다 항상있는 robot.txt 순종에서웹 크롤러의 일반적인 공손 요인?


모두 "금지"및 비 표준 "크롤링 지연 :"

그러나 사이트가 명시 적 크롤 지연을 지정하지 않을 경우 기본값은 무엇을해야

에 설정?

답변

14

우리가 사용하는 알고리즘은 다음과 같습니다

// If we are blocked by robots.txt 
// Make sure it is obeyed. 
// Our bots user-agent string contains a link to a html page explaining this. 
// Also an email address to be added to so that we never even consider their domain in the future 

// If we receive more that 5 consecutive responses with HTTP response code of 500+ (or timeouts) 
// Then we assume the domain is either under heavy load and does not need us adding to it. 
// Or the URL we are crawling are completely wrong and causing problems 
// Wither way we suspend crawling from this domain for 4 hours. 

// There is a non-standard parameter in robots.txt that defines a min crawl delay 
// If it exists then obey it. 
// 
// see: http://www.searchtools.com/robots/robots-txt-elements.html 
double PolitenssFromRobotsTxt = getRobotPolitness(); 


// Work Size politeness 
// Large popular domains are designed to handle load so we can use a 
// smaller delay on these sites then for smaller domains (thus smaller domains hosted by 
// mom and pops by the family PC under the desk in the office are crawled slowly). 
// 
// But the max delay here is 5 seconds: 
// 
// domainSize => Range 0 -> 10 
// 
double workSizeTime = std::min(exp(2.52166863221 + -0.530185027289 * log(domainSize)), 5); 
// 
// You can find out how important we think your site is here: 
//  http://www.opensiteexplorer.org 
// Look at the Domain Authority and diveide by 10. 
// Note: This is not exactly the number we use but the two numbers are highly corelated 
//  Thus it will usually give you a fair indication. 



// Take into account the response time of the last request. 
// If the server is under heavy load and taking a long time to respond 
// then we slow down the requests. Note time-outs are handled above 
double responseTime = pow(0.203137637588 + 0.724386103344 * lastResponseTime, 2); 

// Use the slower of the calculated times 
double result = std::max(workSizeTime, responseTime); 

//Never faster than the crawl-delay directive 
result = std::max(result, PolitenssFromRobotsTxt); 


// Set a minimum delays 
// So never hit a site more than every 10th of a second 
result = std::max(result, 0.1); 

// The maximum delay we have is every 2 minutes. 
result = std::min(result, 120.0)