Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词
发表于:2024-11-26 作者:千家信息网编辑
千家信息网最后更新 2024年11月26日,小编给大家分享一下Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词,希望大家阅读完这篇文章之后都有所收获,下面让我们一起去探讨吧!环境部署scrapy安装pip install s
千家信息网最后更新 2024年11月26日Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词
小编给大家分享一下Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词,希望大家阅读完这篇文章之后都有所收获,下面让我们一起去探讨吧!
环境部署
scrapy安装
pip install scrapy -i https://pypi.douban.com/simple
selenium安装
pip install selenium -i https://pypi.douban.com/simple
jieba安装
pip install jieba -i https://pypi.douban.com/simple
IDE:PyCharm
google chrome driver下载对应版本:google chrome driver下载地址
检查浏览器版本,下载对应版本。
实现过程
下面开始搞起。
创建项目
使用scrapy命令创建我们的项目。
scrapy startproject csdn_hot_words
项目结构,如同官方给出的结构。
定义Item实体
按照之前的逻辑,主要属性为标题关键词对应出现次数的字典。代码如下:
# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class CsdnHotWordsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() words = scrapy.Field()
关键词提取工具
使用jieba分词获取工具。
#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time : 2021/11/5 23:47# @Author : 至尊宝# @Site : # @File : analyse_sentence.py import jieba.analyse def get_key_word(sentence): result_dic = {} words_lis = jieba.analyse.extract_tags( sentence, topK=3, withWeight=True, allowPOS=()) for word, flag in words_lis: if word in result_dic: result_dic[word] += 1 else: result_dic[word] = 1 return result_dic
爬虫构造
这里需要给爬虫初始化一个浏览器参数,用来实现页面的动态加载。
#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time : 2021/11/5 23:47# @Author : 至尊宝# @Site : # @File : csdn.py import scrapyfrom selenium import webdriverfrom selenium.webdriver.chrome.options import Options from csdn_hot_words.items import CsdnHotWordsItemfrom csdn_hot_words.tools.analyse_sentence import get_key_word class CsdnSpider(scrapy.Spider): name = 'csdn' # allowed_domains = ['blog.csdn.net'] start_urls = ['https://blog.csdn.net/rank/list'] def __init__(self): chrome_options = Options() chrome_options.add_argument('--headless') # 使用无头谷歌浏览器模式 chrome_options.add_argument('--disable-gpu') chrome_options.add_argument('--no-sandbox') self.browser = webdriver.Chrome(chrome_options=chrome_options, executable_path="E:\\chromedriver_win32\\chromedriver.exe") self.browser.set_page_load_timeout(30) def parse(self, response, **kwargs): titles = response.xpath("//div[@class='hosetitem-title']/a/text()") for x in titles: item = CsdnHotWordsItem() item['words'] = get_key_word(x.get()) yield item
代码说明
1、这里使用的是chrome的无头模式,就不需要有个浏览器打开再访问,都是后台执行的。
2、需要添加chromedriver的执行文件地址。
3、在parse的部分,可以参考之前我文章的xpath,获取到标题并且调用关键词提取,构造item对象。
中间件代码构造
添加js代码执行内容。中间件完整代码:
# Define here the models for your spider middleware## See documentation in:# https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signalsfrom scrapy.http import HtmlResponsefrom selenium.common.exceptions import TimeoutExceptionimport time from selenium.webdriver.chrome.options import Options # useful for handling different item types with a single interfacefrom itemadapter import is_item, ItemAdapter class CsdnHotWordsSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn't have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class CsdnHotWordsDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): js = ''' let height = 0 let interval = setInterval(() => { window.scrollTo({ top: height, behavior: "smooth" }); height += 500 }, 500); setTimeout(() => { clearInterval(interval) }, 20000); ''' try: spider.browser.get(request.url) spider.browser.execute_script(js) time.sleep(20) return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request) except TimeoutException as e: print('超时异常:{}'.format(e)) spider.browser.execute_script('window.stop()') finally: spider.browser.close() def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
制作自定义pipeline
定义按照词频统计最终结果输出到文件。代码如下:
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interfacefrom itemadapter import ItemAdapter class CsdnHotWordsPipeline: def __init__(self): self.file = open('result.txt', 'w', encoding='utf-8') self.all_words = [] def process_item(self, item, spider): self.all_words.append(item) return item def close_spider(self, spider): key_word_dic = {} for y in self.all_words: print(y) for k, v in y['words'].items(): if k.lower() in key_word_dic: key_word_dic[k.lower()] += v else: key_word_dic[k.lower()] = v word_count_sort = sorted(key_word_dic.items(), key=lambda x: x[1], reverse=True) for word in word_count_sort: self.file.write('{},{}\n'.format(word[0], word[1])) self.file.close()
settings配置
配置上要做一些调整。如下调整:
# Scrapy settings for csdn_hot_words project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'csdn_hot_words' SPIDER_MODULES = ['csdn_hot_words.spiders']NEWSPIDER_MODULE = 'csdn_hot_words.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent# USER_AGENT = 'csdn_hot_words (+http://www.yourdomain.com)'USER_AGENT = 'Mozilla/5.0' # Obey robots.txt rulesROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)# CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 30# The download delay setting will honor only one of:# CONCURRENT_REQUESTS_PER_DOMAIN = 16# CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)# TELNETCONSOLE_ENABLED = False # Override the default request headers:DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'} # Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlSPIDER_MIDDLEWARES = { 'csdn_hot_words.middlewares.CsdnHotWordsSpiderMiddleware': 543,} # Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = { 'csdn_hot_words.middlewares.CsdnHotWordsDownloaderMiddleware': 543,} # Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html# EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,# } # Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'csdn_hot_words.pipelines.CsdnHotWordsPipeline': 300,} # Enable and configure the AutoThrottle extension (disabled by default)# See https://docs.scrapy.org/en/latest/topics/autothrottle.html# AUTOTHROTTLE_ENABLED = True# The initial download delay# AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies# AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:# AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings# HTTPCACHE_ENABLED = True# HTTPCACHE_EXPIRATION_SECS = 0# HTTPCACHE_DIR = 'httpcache'# HTTPCACHE_IGNORE_HTTP_CODES = []# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
执行主程序
可以通过scrapy的命令执行,但是为了看日志方便,加了一个主程序代码。
#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time : 2021/11/5 22:41# @Author : 至尊宝# @Site : # @File : main.pyfrom scrapy import cmdline cmdline.execute('scrapy crawl csdn'.split())
执行结果
执行部分日志
得到result.txt结果。
看完了这篇文章,相信你对"Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词"有了一定的了解,如果想了解更多相关知识,欢迎关注行业资讯频道,感谢各位的阅读!
代码
标题
浏览器
utf-8
浏览
关键
关键词
版本
结果
项目
全站
框架
中间件
主程序
命令
地址
工具
文件
日志
模式
数据库的安全要保护哪些东西
数据库安全各自的含义是什么
生产安全数据库录入
数据库的安全性及管理
数据库安全策略包含哪些
海淀数据库安全审计系统
建立农村房屋安全信息数据库
易用的数据库客户端支持安全管理
连接数据库失败ssl安全错误
数据库的锁怎样保障安全
安卓车载服务器错误怎么办
数据库io调优
网络技术研究背景
数据库读取需要什么配置
静安区网络技术服务
vip卡数据库原理
网络安全包括信息安全吗
数据库右键单击删除
华为 网络技术工程师 年薪
企业网络安全隐患
网络安全 cto
数据库线上问题排查
陕西网吧网络技术监管
服务器芯片至强
盘锦诚瑞达网络技术有限公司
构建10个数排列的数据库
枣庄直销软件开发公司
正规mes系统软件开发
中专技校网络技术
数字人民币网络安全概念
数据库设计 软件开发
宣传网络安全班会记录
服务器7402 价格
铜川网络安全督查
如何查看数据库用户占用内存
初中毕业去学电子软件开发
艾瑞斯数据库
教育软件开发论文大全
设计制造数据库
自助终端机软件开发