i using beautifulsoup scraping website info. specifically, want gather information on patents google search (title, inventors, abstract, etc). have list of urls each patent, beautifulsoup having trouble sites, giving me following error:
unicodedecodeerror: 'utf8' codec can't decode byte 0xcc in position 531: invalid continuation byte
below error traceback:
traceback (most recent call last): soup = beautifulsoup(the_page,from_encoding='utf-8') file "c:\python27\lib\site-packages\bs4\__init__.py", line 172, in __init__ self._feed() file "c:\python27\lib\site-packages\bs4\__init__.py", line 185, in _feed self.builder.feed(self.markup) file "c:\python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed self.parser.close() file "parser.pxi", line 1209, in lxml.etree._feedparser.close (src\lxml\lxml.etree.c:90597) file "parsertarget.pxi", line 142, in lxml.etree._targetparsercontext._handleparseresult (src\lxml\lxml.etree.c:99984) file "parsertarget.pxi", line 130, in lxml.etree._targetparsercontext._handleparseresult (src\lxml\lxml.etree.c:99807) file "lxml.etree.pyx", line 294, in lxml.etree._exceptioncontext._raise_if_stored (src\lxml\lxml.etree.c:9383) file "saxparser.pxi", line 259, in lxml.etree._handlesaxdata (src\lxml\lxml.etree.c:95945) unicodedecodeerror: 'utf8' codec can't decode byte 0xcc in position 531: invalid continuation byte
i checked encoding of site, , claims 'utf-8'. specified input beautifulsoup well. below code:
import urllib, urllib2 bs4 import beautifulsoup #url = 'https://www.google.com/patents/wo2001019016a1?cl=en' # 1 works url = 'https://www.google.com/patents/wo2006016929a2?cl=en' # 1 doesn't work user_agent = 'mozilla/4.0 (compatible; msie 5.5; windows nt)' values = {'name' : 'somebody', 'location' : 'somewhere', 'language' : 'python' } headers = { 'user-agent' : user_agent } data = urllib.urlencode(values) req = urllib2.request(url, data, headers) response = urllib2.urlopen(req) the_page = response.read() print response.headers['content-type'] print response.headers.getencoding() soup = beautifulsoup(the_page,from_encoding='utf-8')
i included 2 urls. 1 results in error, other works fine (labeled such in comments). in both cases, print html terminal fine, beautifulsoup consistently crashed.
any recommendations? first usage of beautifulsoup.
you should encode string in utf-8:
soup = beautifulsoup(the_page.encode('utf-8'))
Comments
Post a Comment