utf 8 - BeautifulSoup character code error -

i using beautifulsoup scraping website info. specifically, want gather information on patents google search (title, inventors, abstract, etc). have list of urls each patent, beautifulsoup having trouble sites, giving me following error:

unicodedecodeerror: 'utf8' codec can't decode byte 0xcc in position 531: invalid continuation byte

below error traceback:

traceback (most recent call last):     soup = beautifulsoup(the_page,from_encoding='utf-8')   file "c:\python27\lib\site-packages\bs4\__init__.py", line 172, in __init__     self._feed()   file "c:\python27\lib\site-packages\bs4\__init__.py", line 185, in _feed     self.builder.feed(self.markup)   file "c:\python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed     self.parser.close()   file "parser.pxi", line 1209, in lxml.etree._feedparser.close (src\lxml\lxml.etree.c:90597)   file "parsertarget.pxi", line 142, in lxml.etree._targetparsercontext._handleparseresult (src\lxml\lxml.etree.c:99984)   file "parsertarget.pxi", line 130, in lxml.etree._targetparsercontext._handleparseresult (src\lxml\lxml.etree.c:99807)   file "lxml.etree.pyx", line 294, in lxml.etree._exceptioncontext._raise_if_stored (src\lxml\lxml.etree.c:9383)   file "saxparser.pxi", line 259, in lxml.etree._handlesaxdata (src\lxml\lxml.etree.c:95945) unicodedecodeerror: 'utf8' codec can't decode byte 0xcc in position 531: invalid continuation byte

i checked encoding of site, , claims 'utf-8'. specified input beautifulsoup well. below code:

import urllib, urllib2 bs4 import beautifulsoup  #url = 'https://www.google.com/patents/wo2001019016a1?cl=en'  # 1 works url = 'https://www.google.com/patents/wo2006016929a2?cl=en' # 1 doesn't work  user_agent = 'mozilla/4.0 (compatible; msie 5.5; windows nt)' values = {'name' : 'somebody',           'location' : 'somewhere',           'language' : 'python' } headers = { 'user-agent' : user_agent }  data = urllib.urlencode(values) req = urllib2.request(url, data, headers) response = urllib2.urlopen(req) the_page = response.read()  print response.headers['content-type'] print response.headers.getencoding()  soup = beautifulsoup(the_page,from_encoding='utf-8')

i included 2 urls. 1 results in error, other works fine (labeled such in comments). in both cases, print html terminal fine, beautifulsoup consistently crashed.

any recommendations? first usage of beautifulsoup.

you should encode string in utf-8:

soup = beautifulsoup(the_page.encode('utf-8'))

Brazie

Search This Blog

utf 8 - BeautifulSoup character code error -

Comments

Post a Comment