该项目起因

因为家里人想要听潮剧的音频,而各大音乐网站又缺少对潮剧的资源支持,于是便尝试在冷门网站上寻找,发现九酷网站有较多的潮剧资源,但由于其简陋的界面和其缺少下载功能,便尝试通过爬虫来进行爬取

问题处理过程

首先尝试最基本的请求下载,用request直接请求url,在源代码找不到音频的播放网址,于是便进入开发者模式筛选出音频的文件直接访问发现其出现了403的返回,在对headers的观察后发现原先我自定义的headers缺少了referer,由此可看出该网址需要满足在本域请求,跨域请求会被禁止,为了避免其他由于headers出现的问题,便将headers完善的和浏览器一致,又考虑在源代码处没有音频的url,则考虑其应该是ajax的形式加载,于是便筛选xhr文件发现其playlist.php出现了包含音频网址的json格式,但我对其进行get请求后发现返回空白,在浏览器直接访问其网址也显示空白,后来仔细看给出的信息发现该请求是post请求,直接用get请求会返回不一样的界面

代码实现过程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import os
import requests
import re

file = "D:\\Desktop\\潮剧下载"
if not os.path.exists(file):
os.mkdir(file)
os.chdir(file)

cookies = {
'tt': 'ok',
'cc': 'ok',
'pp': 'ok',
'_gid': 'GA1.2.192858228.1660697434',
'Hm_lvt_2698d85c1eaff072c02e1fb3b945eaff': '1660697836',
'jpvolume': '0.738',
'tmp_addplay': '',
'Hm_lvt_a5de315acb973b8e6da83458c9e456d3': '1660181157,1660697423,1660722528,1660786572',
'_gat_gtag_UA_83639769_5': '1',
'ff': 'ok',
'l_music': '934435/934408/42620/934478/934477/859925/888053/934351/934520/934284/860206/934504l_end',
'visit_nums': '2',
'shows': 'ok',
'jk_ifplay': '1',
'_ga_4BBS961T5P': 'GS1.1.1660786581.8.1.1660786600.0.0.0',
'_ga': 'GA1.2.83816721.1660181158',
'Hm_lpvt_a5de315acb973b8e6da83458c9e456d3': '1660786600',
'jk_addplay': '934408/934409/934410/934411/934412/934413/934414/934415/934416/934417/934418/934419/934420/934421/934422/934423/934424/934425/934426/934427/934428/934431/934432/934433/934435/934436/934437/934438/934439/934440/934441/934442/934443/934444/934445/934446/934447/934448/934449/934450/934504/934284/934539/934534/934351/934451/934505/934257/934520/934268/934276/934312/934401/934429/934430/934434/934460/934462/934473/934476/934481/934484/934499/934245/934503/934506/934510/934527/934274/934275/934546/934319/934396/934400/934402/934403/934404/934405/934406/934407/934452/934453/934454/934455/934456/934457/934458/934459/934461/934463/934464/934465/934466/934467/934468/934469/934470/934471/934472/934474/934475/934477/934478/934479/934480/934482/934483/934485/934486/934487/934488/934489/934490/934491/934492/934237/934493/934238/934494/934239',
}

headers = {
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'https://www.9ku.com',
'Referer': 'https://www.9ku.com/play/934435.htm',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.81 Safari/537.36 Edg/104.0.1293.54',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua': '"Chromium";v="104", " Not A;Brand";v="99", "Microsoft Edge";v="104"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
}

data = {
'ids': '934408,934409,934410,934411,934412,934413,934414,934415,934416,934417,934418,934419,934420,934421,934422,934423,934424,934425,934426,934427,934428,934431,934432,934433,934435,934436,934437,934438,934439,934440,934441,934442,934443,934444,934445,934446,934447,934448,934449,934450,934504,934284,934539,934534,934351,934451,934505,934257,934520,934268,934276,934312,934401,934429,934430,934434,934460,934462,934473,934476,934481,934484,934499,934245,934503,934506,934510,934527,934274,934275,934546,934319,934396,934400,934402,934403,934404,934405,934406,934407,934452,934453,934454,934455,934456,934457,934458,934459,934461,934463,934464,934465,934466,934467,934468,934469,934470,934471,934472,934474,934475,934477,934478,934479,934480,934482,934483,934485,934486,934487,934488,934489,934490,934491,934492,934237,934493,934238,934494,934239',
}

response = requests.post('https://www.9ku.com/playlist.php', cookies=cookies, headers=headers, data=data)
for i in response.json():
url = i['wma']
title = i['mname']
# print(url,title)
response_final = requests.get(url,headers=headers)
with open(title + '.mp3','wb') as f:
f.write(response_final.content)
print(title + "正在下载中")

文章作者: magic stone
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 magic stone !
评论
 上一篇
2022-08-28 magic stone
下一篇 
2022-07-19 magic stone
  目录