Python:获取全国旅客列车车次及其始发终点站(更新)
主要就是用到12306的查询API接口,即输入出发站、目的站、乘车日期,即返回所有可能的列车。所以采用遍历出发站、目的站的方法,得到全国所有旅客列车的车次等信息。
一、获取火车站编码
12306的所有火车站编码信息在这个javascript
文件中:
https://kyfw.12306.cn/otn/resources/js/framework/station_name.js
每个火车站由@
符号分隔,每个火车站信息由|
符号分隔,如:
@bjb|北京北|VAP|beijingbei|bjb|0
即:
bjb 北京北 VAP beijingbei bjb 0
bjb
即火车站代号,北京北
即火车站名,VAP
即火车站编码,beijingbei
即火车站拼音,bjb
即拼音简称,0
即火车站序号。
通过正则表达式可以很容易提取出来:
@([a-z]*)\|(.*?)\|([A-Z]*)\|([a-z]*)\|([a-z]*)\|([0-9]*)
最后提取出来的效果像这样:
bjb 北京北 VAP beijingbei bjb 0
bjd 北京东 BOP beijingdong bjd 1
bji 北京 BJP beijing bj 2
bjn 北京南 VNP beijingnan bjn 3
bjx 北京西 BXP beijingxi bjx 4
gzn 广州南 IZQ guangzhounan gzn 5
cqb 重庆北 CUW chongqingbei cqb 6
cqi 重庆 CQW chongqing cq 7
cqn 重庆南 CRW chongqingnan cqn 8
gzd 广州东 GGQ guangzhoudong gzd 9
其中最需要的就是第三列,火车站编码。
完整代码如下,这里我是直接写入到了数据库:
"""
用来获取全国火车站的名字、编码等信息,直接存储到数据库。
"""
import re
import urllib.request
import ssl
import mysql.connector
ssl._create_default_https_context = ssl._create_unverified_context
if __name__ == '__main__':
cnx = mysql.connector.connect(
user='root', password='xxxxx', database='12306')
cursor = cnx.cursor()
# 数据库插入命令
add_train = 'INSERT INTO station (bianma,mingzi,daima,pinyin,suoxie,xuhao) VALUES (%s,%s,%s,%s,%s,%s)'
# 含有全国火车站名字、编码等信息的javascript文件
url = 'https://kyfw.12306.cn/otn/resources/js/framework/station_name.js'
con = urllib.request.urlopen(url)
js = con.read().decode('utf-8')
# 使用正则表达式进行分割
r = re.findall(
'@([a-z]*)\|(.*?)\|([A-Z]*)\|([a-z]*)\|([a-z]*)\|([0-9]*)', js)
for line in r:
cursor.execute(add_train, line)
cnx.commit()
cursor.close()
cnx.close()
print('done')
可以得到火车站数量大约有2400个,这样完整遍历需要调用API接口多达2400*2400次,显然不现实。因一列火车必定会途径至少两个“大”站,所以可以从这里下手;分析刚才的火车站可以发现,实际上12306早已对火车站分级,所以只用选取前面的大约500个火车站。再注意到,使用12306时,同一个地点的不同站点,如南京站、南京南站,12306实际上是同等对待,不会区别开的;所以这里还可以手动删减掉一部分火车站,我删减后得到了462个火车站,这样还算可以接受。
二、获取列车车次等信息
这里实际是两层循环遍历,出发站为外层循环,目的站为内层循环。
调用API接口得到的是JSON格式的数据,利用Python内置的json模块可以很方便地解析;所以只需要将每次得到的列车信息存入数据库就可以了。
完整代码如下:
"""
获取全国客运火车车次、始发站、终点站等信息。
实际就是采用遍历始发站、终点站,使用12306的API搜索车次,将得到的车次存入数据库。
"""
import urllib.request
import ssl
import json
import socket
import random
import queue
import threading
import mysql.connector
mutex = threading.Lock() # 多线程获取出发站。目的站锁
socket.setdefaulttimeout(5) # 5秒超时
# 12306证书问题,禁止证书检测
ssl._create_default_https_context = ssl._create_unverified_context
# 数据库插入命令
add_train = 'INSERT IGNORE INTO train_info (start_station_telecode,end_station_telecode,seat_feature,seat_types,train_no,station_train_code,train_seat_feature) VALUES (%s,%s,%s,%s,%s,%s,%s)'
headers = [{ # 随机选择headers发送GET请求
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36'
},
{
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0'
},
{
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.2; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C)'
},
{
'User-Agent': 'Opera/9.80 (Windows NT 5.1; U; zh-cn) Presto/2.9.168 Version/11.50'
},
{
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1'
}]
queue_start = queue.Queue() # 出发站队列
queue_end = queue.Queue() # 目的站队列
def fill_queue(queue): # 填充队列
# station_lite.txt为去除重复站点,只包括一级、二级站点的车站编码等的文档
with open('station_lite.txt', 'r') as f:
for line in f:
queue.put(line.split('\t')[2]) # 车站编码
print('fill queue success')
def init(): # 填充出发站。目的站队列
fill_queue(queue_start)
fill_queue(queue_end)
print('init success')
def start(): # 主程序,即两层循环
date = '2015-07-31' # 设定查询的时间
opener = urllib.request.build_opener()
start_station = queue_start.get()
while(not queue_start.empty()): # 出发站为大循环
with mutex: # 多线程锁
if(not queue_end.empty()): # 考虑目的站队列用完的情况
end_station = queue_end.get()
else:
fill_queue(queue_end)
start_station = queue_start.get()
cnx.commit() # 每完成一个大循环,数据库commit一次
get_train_info(opener, start_station, end_station, date)
# 获取特定出发站、目的站、日期的查询结果
def get_train_info(opener, start_station, end_station, date):
print('dealing with:', start_station, end_station, end=' ')
get = urllib.request.Request('https://kyfw.12306.cn/otn/leftTicket/query?leftTicketDTO.train_date=' + date + '&leftTicketDTO.from_station=' +
start_station + '&leftTicketDTO.to_station=' + end_station + '&purpose_codes=ADULT', headers=headers[random.randrange(5)], method='GET')
try:
con = opener.open(get).read().decode('utf-8')
except Exception as e: # 出现网络问题则再次调用,直到得到需要的信息
print(e)
get_train_info(opener, start_station, end_station, date)
return
j = json.loads(con) # API接口返回JSON格式列车信息,使用json模块处理
print('found', len(j['data'])) # 显示查询到了多少趟列车
for train in j['data']:
train = train['queryLeftNewDTO']
cursor.execute(add_train, (train['start_station_telecode'], train['end_station_telecode'], train['seat_feature'], train[
'seat_types'], train['train_no'], train['station_train_code'], train['train_seat_feature'])) # 数据库操作
# cnx.commit()
if __name__ == '__main__':
init()
cnx = mysql.connector.connect(
user='root', password='12325963', database='12306')
cursor = cnx.cursor()
threads = []
for i in range(1): # 这个查询API如果访问过于频繁会封IP,但这里还是保留了多线程功能,应对可能出现的情况
d = threading.Thread(target=start)
threads.append(d)
for d in threads:
d.start()
for d in threads:
d.join()
print('done')
注意:
- station_lite.txt是我删除部分火车站后的火车站信息
- 代码保留了多线程并发调用API的功能,但过于频繁调用此API会导致封IP,所以要谨慎使用
- 如果网络状态极好,API调用可以达到0.1秒/次,则需要考虑添加time.sleep()减慢调用
- MySQL插入命令里使用IGNORE可以实现主键无重复插入,所以去重在插入时实现
- 网络良好情况下理论最快也需要12小时才能遍历完,所以需要注意网络环境
可能的改进:
- 12306禁封IP后会出现403错误,还没有专门针对403错误增加错误处理代码,即,即使出现403错误也会无限循环
- 现在我能够想出来的加快遍历的方法就是使用IP代理,不过还未实践
Thanks for finally writing about >Python:获取全国旅客列车车次及其始发终点站(更新) - Penguin <Liked it!
I tend not to drop many comments, but avter browsing a few of the responses
on this page Python:获取全国旅客列车车次及其始发终点站(更新) - Penguin. I do have 2 questions for you if it's okay.
Could iit be just me or do a few of these responses come across as if
they aare left by brain dead individuals? :-P And, if you are posting on othdr online social sites,
I would like to folpow everything new you have to post.
Would you post a list of aall of your social community pages like your twitter feed, Facebook page or linkedin profile?
Thanks for finally writing about >Python:获取全国旅客列车车次及其始发终点站(更新) - Penguion <Loved it!
烈焰开区一条龙服务端www.a3sf.com奇迹Musf一条龙开服www.a3sf.com-客服咨询QQ776356990-Email:776356990@qq.com
I believe everything typed was actually very logical.
But, think about this, suppose you were to create a awesome headline?
I am not suggesting your content isn't solid, but what if you added a headline that
grabbed people's attention? I mean Python:获取全国旅客列车车次及其始发终点站(更新) - Penguin is a
little boring. You ought to look at Yahoo's front page and watch how they write article headlines to grab viewers to open the links.
You might try adding a video or a pic or two to get people excited about what you've written. Just my opinion, it could bring your posts a little bit more interesting.
实践团参观中共杭州小组纪念馆
我们的幸福与宿命无关相信我
I think everything published was actually very reasonable.
But, think on this, what if you added a little information?
I mean, I don't want to tell you how to run your blog, but
what if you added a headline that makes people want more?
I mean Python:获取全国旅客列车车次及其始发终点站(更新) - Penguin is a little boring.
You ought to glance at Yahoo's home page and see how they create news titles to get viewers interested.
You might add a related video or a picture or two to get people
interested about what you've written. In my opinion, it might bring your posts
a little livelier.
Thanks for finally talking about >Python:获取全国旅客列车车次及其始发终点站(更新) - Penguin <Loved it!
Autocad课程
Thanks for finally talking about >Python:获取全国旅客列车车次及其始发终点站(更新) - Penguin <Loved it!
シュプリームコピー新作激安販売店,allbrandsjp2021最高级シュプリーム,シュプリーム【SUPREME】 スーパーコピー 専門店 ... スーパーコピー シュプリーム SUPREME コピー通販販売のバック,財布,服,靴,ベルト,ジーンズ, マフラー.
FENDI 財布 コピー
supreme ヘアバンド コピーfendi 偽物、allbrandsjp2021 フェンディ ブランド ショルダーストラップ ウェーブブランド コピー 優良 店,コピー 品 販売,激安 カルティエ サントス100 xl 腕時計 ウォッチ .supremeヘアバンドの人気アイテム「メルカリ」でお得に通販、誰でも安心して簡単に売り買いが楽しめるフリマサービスです。
RAYBAN スーパーコピー
スーパーコピー シュプリーム SUPREME コピー通販販売のバックallbrandsjp2021,財布,服,靴,ベルト,シュプリーム コピー,シュプリーム ヴィトン 財布 偽物,シュプリーム財布偽物, supreme 財布 コピー,シュプリームコピー財布,シュプリームバッグコピー.
偽フェンディ
シュプリーム iphoneケース コピーSUPREME allbrandsjp2021スマホケーシュプリーム iphoneケース コピーSUPREMEスマホケースiPhoneXRケースアイフォンXRケース5色可選数量限定格安. ブランド コピー 販売 店_シュプリーム パーカー スーパー.
[url=https://www.aaakopi.com/brand-17-c0.html]ウブロコピー[/url]
I drop a leave a response whenever I especially enjoy a post
on a website or I have something to contribute to the conversation. Usually it is triggered by the passion communicated in the
article I browsed. And on this post Python:获取全国旅客列车车次及其始发终点站(更新) - Penguin. I was excited enough to leave a thought ;) I do have 2 questions for you if you tend
not to mind. Could it be only me or do some of these comments come across as if
they are coming from brain dead people? :-P And, if you are writing at other social sites, I'd like to keep
up with you. Would you make a list all of all your shared pages like your twitter feed, Facebook page or linkedin profile?
Thanks foг fіnally writing aboսt >Python:获取全国旅客列车车次及其始发终点站(更新) - Pnguin <Loved it!
It is the best time to make a few plans for the long run and it
is time to be happy. I've learn this put up and if I
may just I want to recommend you few attention-grabbing
issues or suggestions. Maybe you could write subsequent articles relating to this article.
I want to read more things about it!