2017-02-22 8 views
0

이 샘플에서는 nutch 2.3.1을 사용하여 데이터를 크롤링하고 어디서나 제목과 URL을 가져와야합니다.웹 크롤링 된 데이터 (돼지)에서 돼지 스크립트를 사용하여 특정 데이터를 추출하는 방법

나는 돼지 HBase를에서 데이터를 가져 오려면이 명령을 사용

`data9 = load 'hbase://htest15_webpage' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('f:cnt', '-loadKey true'); 

`

column=f:cnt, timestamp=1487743991250, value=<!DOCTYPE htm 
        l>\x0D\x0A<!--[if IE 7]>\x0D\x0A<html class="ie ie7" lang= 
        "en-US" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns 
        /fb#">\x0D\x0A<![endif]-->\x0D\x0A<!--[if IE 8]>\x0D\x0A<h 
        tml class="ie ie8" lang="en-US" prefix="og: http://ogp.me/ 
        ns# fb: http://ogp.me/ns/fb#">\x0D\x0A<![endif]-->\x0D\x0A 
        <!--[if !(IE 7) | !(IE 8) ]><!-->\x0D\x0A<html lang="en-U 
        S" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#" 
        >\x0D\x0A<!--<![endif]-->\x0D\x0A<head>\x0D\x0A <meta cha 
        rset="UTF-8" /> \x0D\x0A <meta name="viewport" content="w 
        idth=device-width" /> \x0D\x0A \x0D\x0A<title>www.hardhe 
        ro.com</title>\x0A\x0A<!-- SEO Ultimate (http://www.seodes 
        ignsolutions.com/wordpress-seo/) -->\x0A\x09<meta name="de 
        scription" content="hardhero.com blog that provides intere 
        sting articles on general topics and ideas." />\x0A\x09<me 
        ta name="keywords" content="Internet Blog,SEO Blog,Interne 
        t Marketing Guide" />\x0A\x09<meta property="og:type" cont 
        ent="blog" />\x0A\x09<meta property="og:title" content="ww 
        w.hardhero.com" />\x0A\x09<meta property="og:url" content= 
        "http://hardhero.com/" />\x0A\x09<meta property="og:site_n 
        ame" content="www.hardhero.com" />\x0A\x09<meta name="twit 
        ter:card" content="summary" />\x0A<!-- /SEO Ultimate -->\x 
        0A\x0A<link rel="alternate" type="application/rss+xml" tit 
        le="www.hardhero.com &raquo; Feed" href="http://hardhero.c 
        om/feed/" />\x0A<link rel="alternate" type="application/rs 
        s+xml" title="www.hardhero.com &raquo; Comments Feed" href 
        ="http://hardhero.com/comments/feed/" />\x0A\x09\x09<scrip 
        t type="text/javascript">\x0A\x09\x09\x09window._wpemojiSe 
        ttings = {"baseUrl":"http:\x5C/\x5C/s.w.org\x5C/images\x5C 
        /core\x5C/emoji\x5C/72x72\x5C/","ext":".png","source":{"co 
        ncatemoji":"http:\x5C/\x5C/hardhero.com\x5C/wp-includes\x5 
        C/js\x5C/wp-emoji-release.min.js?ver=4.2.11"}};\x0A\x09\x0 
        9\x09!function(a,b,c){function d(a){var c=b.createElement(
        "canvas"),d=c.getContext&&c.getContext("2d");return d&&d.f 
        illText?(d.textBaseline="top",d.font="600 32px Arial","fla 
        g"===a?(d.fillText(String.fromCharCode(55356,56812,55356,5 
        6807),0,0),c.toDataURL().length>3e3):(d.fillText(String.fr 
        omCharCode(55357,56835),0,0),0!==d.getImageData(16,16,1,1) 
        .data[0])):!1}function e(a){var c=b.createElement("script" 
       );c.src=a,c.type="text/javascript",b.getElementsByTagName(
        "head")[0].appendChild(c)}var f,g;c.supports={simple:d("si 
        mple"),flag:d("flag")},c.DOMReady=!1,c.readyCallback=funct 
        ion(){c.DOMReady=!0},c.supports.simple&&c.supports.flag||(
        g=function(){c.readyCallback()},b.addEventListener?(b.addE 
        ventListener("DOMContentLoaded",g,!1),a.addEventListener(" 
        load",g,!1)):(a.attachEvent("onload",g),b.attachEvent("onr 
        eadystatechange",function(){"complete"===b.readyState&&c.r 
        eadyCallback()})),f=c.source||{},f.concatemoji?e(f.concate 
        moji):f.wpemoji&&f.twemoji&&(e(f.twemoji),e(f.wpemoji)))}(
        window,document,window._wpemojiSettings);\x0A\x09\x09</scr 
        ipt>\x0A\x09\x09<style type="text/css">\x0Aimg.wp-smiley,\ 
        x0Aimg.emoji {\x0A\x09display: inline !important;\x0A\x09b 
        order: none !important;\x0A\x09box-shadow: none !important 
        ;\x0A\x09height: 1em !important;\x0A\x09width: 1em !import 
        ant;\x0A\x09margin: 0 .07em !important;\x0A\x09vertical-al 
        ign: -0.1em !important;\x0A\x09background: none !important 
        ;\x0A\x09padding: 0 !important;\x0A}\x0A</style>\x0A<link 
        rel='stylesheet' id='es-widget-css-css' href='http://hard 
        hero.com/wp-content/plugins/email-subscribers/widget/es-wi 
        dget.css?ver=4.2.11' type='text/css' media='all' />\x0A<li 
        nk rel='stylesheet' id='shootingstar-style-css' href='htt 
        p://hardhero.com/wp-content/themes/shootingstar/style.css? 
        ver=4.2.11' type='text/css' media='all' />\x0A<link rel='s 
        tylesheet' id='shootingstar-elegantfont-css' href='http:/ 
        /hardhero.com/wp-content/themes/shootingstar/css/elegantfo 

>

답변

1

적절한 스크립트를 제공하기 위해 귀하의 유스 케이스를 redproduce 정말 어렵다 귀하의 콘텐츠가 꽉 차서 사용할 수 없기 때문입니다. 텍스트에 XPATH를 사용하는 일반적인 조언을 org.apache.pig.piggybank.evaluation.xml.XPath 패키지에서 제공 할 것입니다. 값 내에 유효한 HTML 내용이있는 경우 제목 및 URL을 반환하는 XPATH 쿼리 집합을 수행 할 수 있습니다.

가 XPATH

을 통해 돼지와 협력의 예는이 link를 참조
+0

감사합니다 나를 내가 당신을 다시 얻을 것이다 그 작업을 해보자 @ 알렉스. –