目的

梳理一些常用的库和api

requests

bug

requests.exceptions.InvalidHeader: Invalid return character or leading space in header: Host

原因在于从浏览器上粘贴的header上有空格

1	"Host": " xxxx.com",

正确的header去掉前面的空格

1	"Host": "xxxx.com",

session

参考文献

不使用系统代理

指定域名不使用代理

import os
import requests
 
os.environ['NO_PROXY'] = 'stackoverflow.com'
 
response = requests.get('http://www.stackoverflow.com')

设置代理为None

import requests
 
proxies = {
  "http": None,
  "https": None,
}
 
requests.get("http://example.org", proxies=proxies)

设置trust_env = False

import requests
 
session = requests.Session()
session.trust_env = False
 
response = session.get('http://www.stackoverflow.com')

BeautifulSoup

简介

BeautifulSoup经常被用来处理html文本，提取对应的dom元素，或者操作dom元素，添加修改删除等

导入

1	from bs4 import BeautifulSoup

常用API

添加dom元素

在bs下添加dom元素变得很简单

# 新建一个dom标签
new_table = soup.new_tag("table")
# 对建立的标签的属性值进行添加
new_table.attrs = {'src': f"{month}.html","height": "602","width":"996"}
# 将新建的dom标签插入到相应的位置中
new_table.append(soup.new_tag("tr"))

直接查找html标签

如果是简单html标签的操作，直接对bs对象进行下标索引即可以得到，例如查找body标签下的h1标签可以直接如下进行查找

1	soup.body.h1

find和find_all

一般来说,find用于需要查找的元素只有一个的情况,find_all则适用于需要查找多个相应元素的情况,除了返回的类型不一样，用法是一样的
查找class名为特定值的元素

1	soup.find(class_="table_container")

查找属性值为特定值的元素，并返回标签中的文本

1	row.find(attrs={"data-stat" :dataStat}).text

selenium

无界面服务器下selenium的使用

xpath

简介

非常好用的dom元素选择器

基本操作

在浏览器的console下操作

// 选择所有h4标签的href属性，并打印到控制台
// 右键复制对象即可复制数组，直接赋值给python对象
$x("//h4/a/@href").map(item=>item.value)

// 选择所有id值为rongqi_2的元素中的所有href属性
// 右键复制对象即可复制数组，直接赋值给python对象
$x("//*[@id='rongqi_2']//@href").map(item => item.value)

参考文献

Xpath选择器

反参数化

1 2	from urllib.parse import unquote unquote("xxx")

参考文献

scapy

scrapy中的异步和多线程

参考文献

browser_cookie3

获取chrome本地的cookies

github仓库

PermissionError Cookies

chrome的快捷方式-属性-快捷方式-目标中增加--disable-features=LockProfileCookieDatabase

参考文献

Broken decryption in Chrome

遇到这个问题，但是读取firefox的cookies是可以的

ednow

python爬虫技法总结

目的

requests

bug

session

不使用系统代理

BeautifulSoup

简介

导入

常用API

添加dom元素

直接查找html标签

find和find_all

selenium

无界面服务器下selenium的使用

xpath

简介

基本操作

参考文献

Xpath选择器

反参数化

scapy

scrapy中的异步和多线程

browser_cookie3

PermissionError Cookies

Broken decryption in Chrome