外贸独立站(日本)

python模拟登录知乎

发表于 2016-11-20 更新于 2018-04-25

身不饥寒，天未曾负我；学无长进，我何以对天

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author: LoveNight

import requests
import time
import json
import os
import re
import sys
import subprocess
from bs4 import BeautifulSoup as BS


class ZhiHuClient(object):

    """连接知乎的工具类，维护一个Session
    2015.11.11

    用法：

    client = ZhiHuClient()

    # 第一次使用时需要调用此方法登录一次，生成cookie文件
    # 以后可以跳过这一步
    client.login("username", "password")

    # 用这个session进行其他网络操作，详见requests库
    session = client.getSession()
    """

    # 网址参数是账号类型
    TYPE_PHONE_NUM = "phone_num"
    TYPE_EMAIL = "email"
    loginURL = r"http://www.zhihu.com/login/{0}"
    homeURL = r"http://www.zhihu.com"
    captchaURL = r"http://www.zhihu.com/captcha.gif"

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Host": "www.zhihu.com",
        "Upgrade-Insecure-Requests": "1",
    }

    captchaFile = os.path.join(sys.path[0], "captcha.gif")
    cookieFile = os.path.join(sys.path[0], "cookie")

    def __init__(self):
        os.chdir(sys.path[0])  # 设置脚本所在目录为当前工作目录

        self.__session = requests.Session()
        self.__session.headers = self.headers  # 用self调用类变量是防止将来类改名
        # 若已经有 cookie 则直接登录
        self.__cookie = self.__loadCookie()
        if self.__cookie:
            print("检测到cookie文件，直接使用cookie登录")
            self.__session.cookies.update(self.__cookie)
            soup = BS(self.open(r"http://www.zhihu.com/").text, "html.parser")
            print("已登陆账号： %s" % soup.find("span", class_="name").getText())
        else:
            print("没有找到cookie文件，请调用login方法登录一次！")

    # 登录
    def login(self, username, password):
        """
        验证码错误返回：
        {'errcode': 1991829, 'r': 1, 'data': {'captcha': '请提交正确的验证码 :('}, 'msg': '请提交正确的验证码 :('}
        登录成功返回：
        {'r': 0, 'msg': '登陆成功'}
        """
        self.__username = username
        self.__password = password
        self.__loginURL = self.loginURL.format(self.__getUsernameType())
        # 随便开个网页，获取登陆所需的_xsrf
        html = self.open(self.homeURL).text
        soup = BS(html, "html.parser")
        _xsrf = soup.find("input", {"name": "_xsrf"})["value"]
        # 下载验证码图片
        while True:
            captcha = self.open(self.captchaURL).content
            with open(self.captchaFile, "wb") as output:
                output.write(captcha)
            # 人眼识别
            print("=" * 50)
            print("已打开验证码图片，请识别！")
            subprocess.call(self.captchaFile, shell=True)
            captcha = input("请输入验证码：")
            os.remove(self.captchaFile)
            # 发送POST请求
            data = {
                "_xsrf": _xsrf,
                "password": self.__password,
                "remember_me": "true",
                self.__getUsernameType(): self.__username,
                "captcha": captcha
            }
            res = self.__session.post(self.__loginURL, data=data)
            print("=" * 50)
            # print(res.text) # 输出脚本信息，调试用
            if res.json()["r"] == 0:
                print("登录成功")
                self.__saveCookie()
                break
            else:
                print("登录失败")
                print("错误信息 --->", res.json()["msg"])

    def __getUsernameType(self):
        """判断用户名类型
        经测试，网页的判断规则是纯数字为phone_num，其他为email
        """
        if self.__username.isdigit():
            return self.TYPE_PHONE_NUM
        return self.TYPE_EMAIL

    def __saveCookie(self):
        """cookies 序列化到文件
        即把dict对象转化成字符串保存
        """
        with open(self.cookieFile, "w") as output:
            cookies = self.__session.cookies.get_dict()
            json.dump(cookies, output)
            print("=" * 50)
            print("已在同目录下生成cookie文件：", self.cookieFile)

    def __loadCookie(self):
        """读取cookie文件，返回反序列化后的dict对象，没有则返回None"""
        if os.path.exists(self.cookieFile):
            print("=" * 50)
            with open(self.cookieFile, "r") as f:
                cookie = json.load(f)
                return cookie
        return None

    def open(self, url, delay=0, timeout=10):
        """打开网页，返回Response对象"""
        if delay:
            time.sleep(delay)
        return self.__session.get(url, timeout=timeout)

    def getSession(self):
        return self.__session

if __name__ == '__main__':
    client = ZhiHuClient()

    # 第一次使用时需要调用此方法登录一次，生成cookie文件
    # 以后可以跳过这一步
    client.login("username", "password")

    # 用这个session进行其他网络操作，详见requests库
    session = client.getSession()

MySQL排序规则utf8_unicode_ci与utf8_general_ci的区别

发表于 2016-11-07 更新于 2018-03-29

某天，你无端想起一个人，她曾让你对明天有所期许，但是却完全没有出现在你的明天里。

概括

MySQL支持30多种字符集的70多种校对规则。字符集和它们的默认校对规则可以通过SHOW CHARACTER SET语句显示；
ci是 case insensitive, 即 “大小写不敏感”, a 和 A 会在字符判断中会被当做一样的;
bin 是二进制, a 和 A 会别区别对待.
例如你运行:

1	SELECT * FROM table WHERE txt = 'a'

那么在utf8_bin中你就找不到 txt = ‘A’ 的那一行, 而 utf8_general_ci 则可以.
utf8_general_ci 不区分大小写，这个你在注册用户名和邮箱的时候就要使用。
utf8_general_cs 区分大小写，如果用户名和邮箱用这个就会照成不良后果
utf8_bin:字符串每个字符串用二进制数据编译存储。区分大小写，而且可以存二进制的内容

一句话

对与general来说 ß = s 是为true的,但是对于unicode来说 ß = ss 才是为true的，其实他们的差别主要在德语和法语上，所以对于我们中国人来说，一般使用general，因为general更快,如果你对德语和法语的对比有更高的要求，才使用unicode，它比general更准确一些（按照德语和法语的标准来说，在对比或者排序上更准确）
utf8_unicode_ci比较准确，utf8_general_ci速度比较快。

mysql优化

发表于 2016-11-05 更新于 2018-04-25

我预见了未来，那里面没有你

慢查询

查看慢查询日志是否开启：show variables like ‘slow_query_log’;
如果没有开启：则设置set global log_queries_not_using_indexes=on;
查询记录慢查询时间：show variables like ‘long_query_time’;
设置慢查询记录时间：set global long_query_time=2;
开启慢查询：set global slow_query_log=on;
查看慢查询的日志位置：show variables like ‘slow%’;

数据库监控

QPS
TPS
并发数 Threads_running
主从复制
对服务器资源，磁盘空间
read_only参数
最大连接数 max_connections
当前连接数 Threads_connected

centos 系统参数优化

内核相关参数 /etc/sysctl.conf

net.core.somaxconn = 65535
net.core.netdev_max_backlog = 65535
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
内存相关
kernel.shmmax = 42955555 这个参数应该设置的足够大，以便能在一个共享内存段下容纳整个的Innodb缓冲池的大小
增加资源限制 /etc/security/limit.conf 打开文件的限制
磁盘调度策略 /sys/block/devname/queue/scheduler

python的多进程和多线程

发表于 2016-10-26 更新于 2018-03-27

我曾经听人说过，当你不可以再拥有的时候。你唯一可以做的，就是让自己不要忘记。

### 简述很多同学都听说过，现代操作系统比如Mac OS X，UNIX，Linux，Windows等，都是支持“多任务”的操作系统。

什么叫“多任务”呢？简单地说，就是操作系统可以同时运行多个任务。打个比方，你一边在用浏览器上网，一边在听MP3，一边在用Word赶作业，这就是多任务，至少同时有3个任务正在运行。还有很多任务悄悄地在后台同时运行着，只是桌面上没有显示而已。

现在，多核CPU已经非常普及了，但是，即使过去的单核CPU，也可以执行多任务。由于CPU执行代码都是顺序执行的，那么，单核CPU是怎么执行多任务的呢？

答案就是操作系统轮流让各个任务交替执行，任务1执行0.01秒，切换到任务2，任务2执行0.01秒，再切换到任务3，执行0.01秒……这样反复执行下去。表面上看，每个任务都是交替执行的，但是，由于CPU的执行速度实在是太快了，我们感觉就像所有任务都在同时执行一样。

真正的并行执行多任务只能在多核CPU上实现，但是，由于任务数量远远多于CPU的核心数量，所以，操作系统也会自动把很多任务轮流调度到每个核心上执行。

对于操作系统来说，一个任务就是一个进程（Process），比如打开一个浏览器就是启动一个浏览器进程，打开一个记事本就启动了一个记事本进程，打开两个记事本就启动了两个记事本进程，打开一个Word就启动了一个Word进程。

有些进程还不止同时干一件事，比如Word，它可以同时进行打字、拼写检查、打印等事情。在一个进程内部，要同时干多件事，就需要同时运行多个“子任务”，我们把进程内的这些“子任务”称为线程（Thread）。

由于每个进程至少要干一件事，所以，一个进程至少有一个线程。当然，像Word这种复杂的进程可以有多个线程，多个线程可以同时执行，多线程的执行方式和多进程是一样的，也是由操作系统在多个线程之间快速切换，让每个线程都短暂地交替运行，看起来就像同时执行一样。当然，真正地同时执行多线程需要多核CPU才可能实现。

我们前面编写的所有的Python程序，都是执行单任务的进程，也就是只有一个线程。如果我们要同时执行多个任务怎么办？

有两种解决方案：

一种是启动多个进程，每个进程虽然只有一个线程，但多个进程可以一块执行多个任务。

还有一种方法是启动一个进程，在一个进程内启动多个线程，这样，多个线程也可以一块执行多个任务。

当然还有第三种方法，就是启动多个进程，每个进程再启动多个线程，这样同时执行的任务就更多了，当然这种模型更复杂，实际很少采用。

总结一下就是，多任务的实现有3种方式：

多进程模式；
多线程模式；
多进程+多线程模式。

python大数据抓取和存储

发表于 2016-10-26 更新于 2018-03-27

愿你梦里有喝不完的酒，醒来后能酩酊大醉地过完这一生。你要照顾好你黑色的头发，挑剔的胃和爱笑的眼睛。我已经原谅了从前的自己，就像谅解了一个野心勃勃的傻逼，体恤了一个笨手笨脚的勇士，释怀了一个难以启齿的秘密。

### mongodb 安装这里就不再赘述安装pymongo： 1. pip install wheel 2. 下载找对应的 whl 文件 https://pypi.python.org/pypi/pymongo#downloads 3. pip install whl 文件的正确位置

插入数据

import  pymongo
client = pymongo.MongoClient('localhost',27017)
dbname = client['dbname']
tbname = dbname['tbname']
path = 'demo.txt'
with open(path,'r') as f:
    lines = f.readlines()
    for index,line in enumerate(lines):
        data = {
            'index':index,
            'line':line,
            'words':len(line.split())
        }
        tbname.insert_one(data)

读取数据

import  pymongo
client = pymongo.MongoClient('localhost',27017)
dbname = client['dbname']
for item in tbname.find():
    print(item)

#或者
for item in tbname.find({'words':0}):
    print(item)

大数据抓取

# 一般爬取的有多个列表页和每个列表页下的多个内容页
# 事先会把需要爬取的URL放到数据库里
# 一般的错误处理
# 过滤掉一些不正确的url地址
# 需要抓取的页面中的元素不存在
# 页面访问错误，比如404，500这样的错误有可能会引起爬虫报错
# 监控程序，用于监控已经爬取的数据量的多少或者url的多少

实例代码

#头部
from bs4 import BeautifulSoup
import requests
import time
import pymongo

client =pymongo.MongoClient('localhost',27017)
db58 = client['db58']
tb58 = db58['tb58']
tb58item = db58['tb58item']

# 读取具体内容
def get_links_from(channel,pages):
    list_view = '{}/pn{}'.format(channel,str(pages))
    wb_data = requests.get(list_view)
    time.sleep(1)
    soup = BeautifulSoup(wb_data.text,'lxml')
    if soup.find('td','t'):
        for link in soup.select('td.t > a.t'):
            item_link = link.get('href').split('?')[0]
            tb58.insert_one(item_link)
            print(item_link)
    else:
        pass
# 产生多页
def get_all_links_from(channel):
    for num  in range(1,101):
        get_links_from(channel,num)

# 监控程序，实时显示数据库中的数据量
import time
from page_parsing import tb58item

while True:
    print(tb58item.find().count())
    time.sleep(5)

python的一些抽象理解

发表于 2016-10-25 更新于 2022-08-03

除了现实中的蝇营狗苟，我的心中还住着星辰大海

关于导入

1
2
3

from urllib.request import urlopen
# 在urllib库里查找request模块，只导入一个urlopen函数

range用法

#代表从1到5(不包含5)
for i in range(1,5):
    print(i)

#代表从1到5，间隔2(不包含5)
for i in range(1,5,2):
    print(i)

#代表从0到5(不包含5)
for i in range(5):
    print(i)

时间的用法

# 格式化当前时间
import time
print (time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))

# 休息间隔2秒
import time
for i in range(7):
    time.sleep(2)
    print(i)

# 当前时间戳
import time
print(time.time())

format 格式化

def get_links_from(channel,pages):
    list_view = '{}/pn{}'.format(channel,str(pages))
    wb_data = requests.get(list_view)
    time.sleep(1)
    soup = BeautifulSoup(wb_data.text,'lxml')
    if soup.find('td','t'):
        for link in soup.select('td.t > a.t'):
            item_link = link.get('href').split('?')[0]
            tb58.insert_one(item_link)
            print(item_link)
    else:
        pass

if else 结构

number = 23
guess = int(input('请输入一个整数：'))      #等待输入整数
if guess == number:
    print('恭喜，你猜对了。')    # 新块从这里开始
    print('(但你没有获得任何奖品！)')    # 新块在这里结束
elif guess < number:
    print('不对，你猜的有点儿小')    # 另一个块
else:
    print('不对，你猜的有点大')
print('完成')
# if语句执行完后，最后的语句总是被执行

pass

#1、空语句 do nothing
#2、保证格式完整
#3、保证语义完整
以if语句为例，在c或c++/java中：
if(true)  
;//do nothing  
else  
{  
//do something  
}  
# 而在python中
if true:  
pass #do nothing  
else:  
#do something

map函数

# map()函数接收两个参数，一个是函数，一个是序列，map将传入的函数依次作
#用到序列的每个元素并把结果作为新的list返回。
#在python3里面，map()的返回值已经不再是list,而是iterators
def f(x):
    return x * x
a = map(f, [1, 2, 3, 4, 5, 6, 7, 8, 9])
print(list(a))

导入函数和变量

# 位于同一个文件夹下两个py文件
# demo.py,定义一个函数和变量
def haha():
    print('haha')
list = [1,2,3,4,5]

# test.py,调用这个函数和变量
import demo
test.haha()
print(test.list)

# 也可以选择仅仅导入函数或者变量
from test import haha
haha()

BeautifulSoup的一写用法

发表于 2016-10-24 更新于 2022-04-11

世界上最廉价的东西就是男人一事无成时的温柔。

.strings 和 stripped_strings

#如果tag中包含多个字符串,可以使用 .strings 来循环获取:
#输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 
#可以去除多余空白内容:
from bs4 import BeautifulSoup
import  requests
url = 'http://www.tripadvisor.cn/Attractions-g60763-Activities-New_York_City_New_York.html'
wbData = requests.get(url)
response = BeautifulSoup(wbData.text,'lxml')
titles = response.select('div.property_title > a[target="_blank"]')
images = response.select('img[width:"160"]')
cates = response.select('div.p13n_reasoning_v2')
for title,image,cate in zip(titles,images,cates):
    data = {
        'title':title.get_text(),
        'image':image.get('src'),
        'cate':list(cate.stripped_strings)
    }
    print(data)

python字符串操作

发表于 2016-10-24 更新于 2018-03-27

你那能叫活着么？你那只能叫没死。

### 去空格

#strip 同时去掉左右两边的空格
#lstrip 去掉左边的空格
#rstrip 去掉右边的空格
s = '  yang guo qi   '
s= s.strip()
print(s)

### 去一些特殊符号

1
2
3

s = '  yang guo qi  , '
s= s.strip().strip(',')
print(s)

### 计算长度

1
2
3

s = '  yang guo qi  , '
s= len(s)
print(s)

### 大小写转换

s = '  yang guo qi  , '
s= s.upper()
print(s)

s = '  Yang guo qi  , '
s= s.lower()
print(s)

### 反转字符串

1
2
3

s = '  Yang guo qi  , '
s= s[::-1]
print(s)

### 查找字符串

s = '  Yang guo qi  , '
d = 'qi'
print(s.find(d))
# 返回位置

### 字符串截取

str = ’0123456789′
print (str[0:3]) #截取第一位到第三位的字符
print (str[:]) #截取字符串的全部字符
print (str[6:]) #截取第七个字符到结尾
print (str[:-3]) #截取从头开始到倒数第三个字符之前
print (str[2]) #截取第三个字符
print (str[-1]) #截取倒数第一个字符
print (str[::-1]) #创造一个与原字符串顺序相反的字符串
print (str[-3:-1]) #截取倒数第三位与倒数第一位之前的字符
print (str[-3:]) #截取倒数第三位到结尾
print (str[:-5:-3]) #逆序截取，具体啥意思没搞明白？

### 分割字符串

s = 'Yang,guo qi'
d = s.split(',')
print(d)

sStr1 = 'ab,cde,fgh,ijk'
sStr2 = ','
sStr1 = sStr1[sStr1.find(sStr2) + 1:]
print(sStr1)

### 格式化

1.
s = 'this is my {}'.format('dog')
print(s)
2.
s = 'this is my {} and {}'.format('dog','cat')
print(s)
3. 语法求解释
urls = ['oa{}-New_York'.format(str(i)) for i in range(30,930,30)]
print(urls)

python爬虫第一周

发表于 2016-10-24 更新于 2018-04-25

身不饥寒，天未曾负我；学无长进，我何以对天

安装lxml

1. 首先安装pip install wheel
2. 在这里下载对应的.whl文件，注意别改文件名！
   http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
   cp后面是Python的版本号，27表示2.7，根据你的Python版本选择下载。
3. pip install 带后缀的完整文件名

例子(一)

from bs4 import BeautifulSoup
from urllib.request import  urlopen
resp = urlopen('http://www.tooopen.com/img/87.aspx').read().decode('utf-8')
Soup = BeautifulSoup(resp,'lxml')
images = Soup.select('title')
for image in images:
    print(image.get_text())

例子(二)

from bs4 import BeautifulSoup
from urllib.request import  urlopen
resp = urlopen('http://www.tooopen.com/img/87.aspx').read().decode('utf-8')
Soup = BeautifulSoup(resp,'lxml')
images = Soup.select('img')
for image in images:
    print(image.get('src'))

例子(三)

from bs4 import BeautifulSoup
import requests
url = 'http://zhuanzhuan.58.com/detail/789842404801118212z.shtml'
def getInfo(url):
    wpData = requests.get(url)
    soup = BeautifulSoup(wpData.text,'lxml')
    title = soup.title.text
    price = soup.select('.price_now > i')
    tags = soup.select('.qual_label')
    return tags

data = getInfo(url)
for tag in data:
    print(tag.text)

传递http头部

from bs4 import BeautifulSoup
import  requests
url = 'http://www.tripadvisor.cn/Saves#525792'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWeb....36',
    'Cookie':'TASSK=eNs%2BEc2TG.....0N9BrZpKwsTH%2BAwg%3D...; NPID='
}

wpData = requests.get(url,headers = headers)
soup = BeautifulSoup(wpData.text,'lxml')
titles = soup.select('div.title')
print(soup)

构建多页数延迟访问

from bs4 import BeautifulSoup
import  requests
import time
urls = ['http://www.tripadvisor.cn/...-oa{}-New_Y...rk.html#ATTRACTION_LIST'.format(str(i)) for i in range(30,930,30)]
for url in urls:
    time.sleep(2)
    print(url)

异步加载

1	在 chrome 浏览器中找到有规律的，xhr请求链接

技巧

一般PC端的数据如果不能直接拿到，可以尝试访问手机端从而直接达到目的

laravel中如何打印sql

发表于 2016-10-24 更新于 2018-04-08

你也是一片在我胸口颤动的小叶子。生命之风将你吹送至此。

第一种方法

//in our case,
 DB::table('users')->toSql(); 
//return
select * from users

第二种方法

Route::get('haha',function (){
    Event::listen('illuminate.query', function($query, $params, $time, $conn)
    {
        dd(array($query, $params, $time, $conn));
    });

    \RainLab\User\Models\User::whereBetween('created_at', 
    	['2016-05-26 00:00:00', '2016-06-27 00:00:00']
    	)->get(['email']);
});

第三种方法

1
2
3

DB::enableQueryLog();
   $users = \RainLab\User\Models\User::whereBetween('created_at', ['2016-05-26 00:00:00', '2016-06-27 00:00:00'])->get();
   dd(DB::getQueryLog());

第四种方法

function trace_sql($dump = false)
{
    \DB::listen(function ($event) use($dump) {
        if ($dump) {
            dump($event->sql);
            dump($event->bindings);
        }
        info($event->sql);
        info($event->bindings);
    });
}

最后一种方法

1 2	//安装 debug composer require barryvdh/laravel-debugbar