python爬虫第一周

身不饥寒,天未曾负我;学无长进,我何以对天

安装lxml

1
2
3
4
5
1. 首先安装pip install wheel
2. 在这里下载对应的.whl文件,注意别改文件名!
http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
cp后面是Python的版本号,27表示2.7,根据你的Python版本选择下载。
3. pip install 带后缀的完整文件名

例子(一)

1
2
3
4
5
6
7
from bs4 import BeautifulSoup
from urllib.request import urlopen
resp = urlopen('http://www.tooopen.com/img/87.aspx').read().decode('utf-8')
Soup = BeautifulSoup(resp,'lxml')
images = Soup.select('title')
for image in images:
print(image.get_text())

例子(二)

1
2
3
4
5
6
7
from bs4 import BeautifulSoup
from urllib.request import urlopen
resp = urlopen('http://www.tooopen.com/img/87.aspx').read().decode('utf-8')
Soup = BeautifulSoup(resp,'lxml')
images = Soup.select('img')
for image in images:
print(image.get('src'))

例子(三)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from bs4 import BeautifulSoup
import requests
url = 'http://zhuanzhuan.58.com/detail/789842404801118212z.shtml'
def getInfo(url):
wpData = requests.get(url)
soup = BeautifulSoup(wpData.text,'lxml')
title = soup.title.text
price = soup.select('.price_now > i')
tags = soup.select('.qual_label')
return tags

data = getInfo(url)
for tag in data:
print(tag.text)

传递http头部

1
2
3
4
5
6
7
8
9
10
11
12
from bs4 import BeautifulSoup
import requests
url = 'http://www.tripadvisor.cn/Saves#525792'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWeb....36',
'Cookie':'TASSK=eNs%2BEc2TG.....0N9BrZpKwsTH%2BAwg%3D...; NPID='
}

wpData = requests.get(url,headers = headers)
soup = BeautifulSoup(wpData.text,'lxml')
titles = soup.select('div.title')
print(soup)

构建多页数延迟访问

1
2
3
4
5
6
7
from bs4 import BeautifulSoup
import requests
import time
urls = ['http://www.tripadvisor.cn/...-oa{}-New_Y...rk.html#ATTRACTION_LIST'.format(str(i)) for i in range(30,930,30)]
for url in urls:
time.sleep(2)
print(url)

异步加载

1
在 chrome 浏览器中找到有规律的,xhr请求链接

技巧

  • 一般PC端的数据如果不能直接拿到,可以尝试访问手机端从而直接达到目的