BeautifulSoup을 사용하여 text/html 문서에서 깨끗한 텍스트 가져 오기

text/xml과 text/html이라는 두 가지 콘텐츠 유형이있는 문서가 있습니다. BeautifulSoup을 사용하여 문서를 파싱하고 깨끗한 텍스트 버전으로 만들고 싶습니다. 문서는 터플로 시작하므로 repr을 사용하여 BeautifulSoup이 인식하는 것으로 변환 한 다음 find_all을 사용하여 div를 검색하여 문서의 text/html 비트를 찾습니다.BeautifulSoup을 사용하여 text/html 문서에서 깨끗한 텍스트 가져 오기

soup = BeautifulSoup(repr(msg_data)) 
text = soup.html.find_all("div")

str_text = str(text) 
soup_text = BeautifulSoup(str_text) 
soup_text.get_text()

그러나, 그 후 변경 :

그런 다음, 나는 변수에 저장 한 다음 수프 개체로 다시 돌려 그것에 get_text 호출과 같이, 문자열로 다시 텍스트를 돌리겠다 다음과 같이 유니 코드로 인코딩 :

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17  
PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 
9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while 
browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, 
\xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives 
them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'

나는과 같이, UTF-8로 다시 인코딩 할 때 :

soup.encode('utf-8')

내가 다시 구문 분석되지 않은 유형입니다.

깨끗한 텍스트를 문자열로 저장 한 다음 텍스트 (예 : 위의 텍스트에서 "puppies")와 같은 특정 텍스트를 찾을 수 있습니다.

기본적으로 여기 서클에서 돌아 다니고 있습니다. 누구든지 도와 줄 수 있습니까? 언제나처럼, 당신이 줄 수있는 도움을 주셔서 너무 감사드립니다.

출처

2012-03-18 spikem

인코딩이 손상되지 않습니다. 그것은 정확히 그것이되어야합니다. '\xa0'은 줄 바꿈하지 않는 공간의 유니 코드입니다. 당신이 ASCII로이 (유니 코드) 문자열을 인코딩하려면

, 당신은 그것을 이해하지 않는 문자를 무시하도록 코덱을 알 수 있습니다 : 당신은 시간이 있다면

>>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]' 
>>> x.encode('ascii', 'ignore') 
'[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do, 9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while browsing their site, me: srsly, Erica: unless of course your writing is magic, me: My writing saves drowning puppies, Just plucks him right out and gives them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]'

, 당신은 네드 BATCHELDER의 최근주의해야 동영상 Pragmatic Unicode 모든 것을 명확하고 단순하게 만듭니다!

출처

2012-03-18 19:59:05 katrielalex

네, 바로 게시했습니다. "망가진"이 조금 강해서 지금 편집 중입니다. 그리고 비디오를 가져 주셔서 감사합니다. 내가 읽을 수있는 텍스트 리소스가 있는지도 몰라요. (이것들은 단지 Google 검색입니다.하지만 특히 당신이 좋아하는 사람이 있습니까?) – spikem

@spikem 무엇을 기대합니까? 비 ASCII 문자 (비 분리 공간)가있는 문자열이 있습니다. 당신은 단지 그들을 마술 수 없습니다. – katrielalex

나는 내가 물어 보지 않았다고 생각하거나, 그들이 마법을 쓸어 버릴 것이라고 기대하지 않는다. 나는 유니 코드에 대해 거의 잘 모른다. – spikem

BeautifulSoup을 사용하여 text/html 문서에서 깨끗한 텍스트 가져 오기

답변

관련 문제