python3.6 字符编码问题 - V2EX

python3.6 字符编码问题 - V2EX

Home Sign Up Sign In

推荐学习书目

Learn Python the Hard Way

Python Sites

PyPI - Python Package Index

http://diveintopython.org/toc/index.html

值得关注的项目

Read the Docs

Stackless Python

结巴中文分词

Python 编程

Styles

Google Python Style Guide

Code Style from The Hitchhiker's Guide

This topic created in 2925 days ago, the information mentioned may be changed or developed.

准备写个爬虫，监控一个网页，如果有更新就将更新的内容采集并邮件通知我，结果开始就卡住了。。。

环境 + IDE：win10, python3.6.4, vscode
要监控的 URL 为： http://www.wh-ccic.com.cn/node_13613.htm
我需要的内容为每个月份里面的图片，及 http://www.wh-ccic.com.cn/content/2018-05/08/content_443454.htm 和 http://www.wh-ccic.com.cn/content/2018-05/08/content_443453.htm 页面的所有图片，并按月份为文件夹存储

问题：月份提取出来中文显示为乱码，如：201805

我看了网页源码，有声明 charset=, 并且我用的是 python3.6，所以比较纳闷为何为出现乱码，在 Chrome 控制台下测试 xpath 时是没毛病的：

然后各种百度、谷歌的找，大部说到是编码问题，一篇篇的关于编码的文章看得脑壳麻，然后按所说的方法都不能解决，特发贴看有遇到同样问题的朋友没

尝试过的方法：

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')
编码转换, text.encode('utf-8').decode('unicode_escape')

PS: 打印 requests.get() 的 text 所有中文都显示为乱码

下面为测试的 demo：

import requests ''' import re import sys import io sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8') ''' from lxml import html url = 'http://www.wh-ccic.com.cn/node_13613.htm' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' } base_url = 'http://www.wh-ccic.com.cn' page = requests.get(url, headers=headers) tree = html.fromstring(page.text) print(page.text) all_a = tree.xpath('.//*[@class="STYLE13"]/a') for a in all_a: # print(a.attrs['href']) # href = a.attrs['href'] # title = a.text.replace(u'\xe5', u' ') href = a.attrib['href'] title = a.text if '\\u' in title: title = title.encode('utf-8').decode('unicode_escape') print(title)

11 replies 2018-05-11 20:34:17 +08:00

1

lifeishort

May 11, 2018 via iPhone

2

page = requests 后面加一行 page.encoding='utf-8'

2

fushall

May 11, 2018

1

楼上说的对

3

Sylv

May 11, 2018

2

requests 是通过 response 的头部（ headers ）来检测编码的。
虽然这页面在源码里声明了 charset，但是没在 'Content-Type' header 里声明 charset：
Content-Type: text/html

所以 requests 就使用了默认的编码 'ISO-8859-1'：
>>> page.encoding
'ISO-8859-1'

于是就出现乱码问题了。

解决方法是手动设置 requests 解码这个页面使用的编码：
>>> page.encoding = 'utf-8'
>>> tree = html.fromstring(page.text)
>>> title = tree.xpath('.//*[@class="STYLE13"]/a')[0].text
>>> title
'2018 年 05 期'

4

Sylv

May 11, 2018

2

放上 requests 文档中关于编码的说明：

Encodings

When you receive a response, Requests makes a guess at the encoding to use for decoding the response when you access the Response.text attribute. Requests will first check for an encoding in the HTTP header, and if none is present, will use chardet to attempt to guess the encoding.

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

http://docs.python-requests.org/en/latest/user/advanced/#encodings

5

sjmcefc2

May 11, 2018

1

也是遇到编码问题，至今没搞懂 2 和 3 的差别。只是觉得 2 更好用

6

Hopetree

May 11, 2018

1

1 楼正解

7

janxin

May 11, 2018

1

@sjmcefc2 当然是 2 不好用...不过 lz 的问题主要是 requests 非得按照标准来，不处理非标准情况的一些 corner case （当然基本上都是半瓶水码农的锅）。 https://github.cm/requests/requests/issues/1737

8

yonoho

May 11, 2018

很多网站都有这个问题，一般国内默认 utf-8 就好。实在拿不准编码的时候也可以通过 page.content.decode($codec) 一个个试

9

Meli55a

OP

May 11, 2018

@Sylv 因为获取回来的 text 中的中文已经是乱码，所以也怀疑是 requests 那里有问题，可是不知如何解决，又找不到详细的 api 说明，还尝试过在 get 方法里面加 encoding 参数，结果是没有那个参数，却没想到变通一下多写一句。。。

10

sjmcefc2

May 11, 2018

@janxin 2 貌似读进来的都是编码格式的，而 3 貌似是默认 unicode，调用 chardet 就会出错。我的问题更奇葩，就是各种 utf-8，，gbk 混编，一行内各种编码混杂。。。。始终没有很完美的解决方法

11

lifeishort

May 11, 2018 via iPhone

@sjmcefc2 不知道编码的时候试试
resp.encoding=resp.apparent_encoding

About Help Advertise Blog API FAQ Solana 4102 Online Highest 6679

Select Language

创意工作者们的社区

World is powered by solitude

VERSION: 3.9.8.5 56ms UTC 04:16 PVG 12:16 LAX 21:16 JFK 00:16
Do have faith in what you're doing.

ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86