利用 pandas 处理国家统计局数据并展示

pandas 的主要数据结构 Series 对象 一种类似一维数组的对象 由一组数据以及一组与之相关的数据标签(即索引)组成 可以存储任何类型的数据 python 1 2 3 4 0 Python 1 Java 2 C++ 索引 数据 创建 Series 对象 Pandas使用Series()函数来创建Series对象,通过这个对象可以调用相应的方法和属性,从而达到处理数据的目的。 ...

January 29, 2026 · ☕☕ 6 min · 📄 2.7k 字 · Python爬虫

复杂结构数据的获取

PubMed 单篇文献基本信息获取 https://pubmed.ncbi.nlm.nih.gov/33883728/ python 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 import requests from lxml import etree url = "https://pubmed.ncbi.nlm.nih.gov/33883728/" r = requests.get(url).text html = etree.HTML(r) title = html.xpath('//*[@id="full-view-heading"]/h1/text()')[0].strip() print(title) authors = html.xpath('//*[@id="full-view-heading"]/div[2]/div/div/span/a/text()') authors = ','.join(authors) print(authors) pmID = html.xpath('//*[@id="full-view-identifiers"]/li[1]/span/strong/text()')[0] print(pmID) mag = html.xpath('//*[@id="full-view-journal-trigger"]/text()')[0].strip() print(mag) info = html.xpath('//*[@id="full-view-heading"]/div[1]/div[2]/span[2]/text()')[0].split(';') year = info[0][:4] info = info[1] print(info) print(year) abstract = html.xpath('//*[@id="eng-abstract"]/p/text()')[0].strip() print(abstract) try: kw = html.xpath('/html/body/div[5]/main/div[2]/p/text()')[1].strip() print(kw) except: pass PubMed 多篇文献基本信息获取 文章对应链接的获取 在搜索页中,默认为十篇,先爬取一篇文章的链接 ...

January 29, 2026 · ☕ 3 min · 📄 1.4k 字 · Python爬虫

页面数据爬取

简易模板 python 1 2 3 4 5 6 7 8 9 10 11 import requests from bs4 import BeautifulSoup meHeader = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"} url = "" def getOne(url): r = requests.get(url, headers = myHeader).content.decode('utf-8') soup = BeautifulSoup(r, 'html.parser') t = soup.find_all() 单封家书【译文】内容获取 目标网站:http://ewenyan.com/articles/zgfjs/1.html ...

January 28, 2026 · ☕☕ 5 min · 📄 2.1k 字 · Python爬虫

动态数据爬取

单个城市天气数据爬取 确定目标网页 https://www.weather.com.cn/ 分析网页数据 python 1 2 3 4 5 6 7 8 9 10 11 12 import requests from bs4 import BeautifulSoup myHeader = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"} url = "https://www.weather.com.cn/weather1d/101010100.shtml" r = requests.get(url) html = r.content.decode('utf-8') soup = BeautifulSoup(html, "html.parser") print(soup.find('div', class_='tem')) ...

January 28, 2026 · ☕☕☕☕☕ 75 min · 📄 3.7 万字 · Python爬虫

词云绘制

第三方库 wordcloud pip install wordcloud 指定镜像源: pip install -i https://pypi.tuna.tsinghua.edu.cn/simple wordcloud 文档:https://amueller.github.io/word_cloud/index.html wordcloud.WordCloud() 案例 1:”政府工作报告爬取与词云绘制“ python 1 2 3 4 5 6 7 8 9 10 11 12 13 import urllib.request from bs4 import BeautifulSoup from wordcloud import WordCloud url = "https://www.gov.cn/zhuanti/2021lhzfgzbg/index.htm" response = urllib.request.urlopen(url) html = response.read().decode("utf-8") soup = BeautifulSoup(html, "html.parser") content = soup.find("div", class_="zhj-bbqw-cont").text w = WordCloud(font_path="/Fonts/simhei.ttf").generate(content) w.to_file("政府工作报告y1.png") ...

January 28, 2026 · ☕☕ 6 min · 📄 2.6k 字 · Python爬虫

Scrapy 爬虫框架

Scrapy 爬虫框架介绍 官网:https://www.scrapy.org/ 文档:https://docs.scrapy.net.cn/en/latest/ 快速功能强大的网络爬虫框架 Scrapy 的安装 pip install scrapy scrapy -h Scrapy 爬虫框架结构 Scrapy不是一个函数功能库,而是一个爬虫框架。 ...

January 27, 2026 · ☕☕ 4 min · 📄 1.8k 字 · Python爬虫

Re 库入门

正则表达式 regular expression, regex, RE 正则表达式是用来简洁表达一组字符串的表达式 正则表达式是一种针对字符串表达“简洁”和“特征”思想的工具 正则表达式可以用来判断某字符串的特征归属 ...

January 26, 2026 · ☕☕ 7 min · 📄 3.3k 字 · Python爬虫

信息标记与提取方法

信息标记的三种形式 信息的标记 标记后的信息可形成信息组织结构,增加了信息维度 标记的结构与信息一样具有重要价值 标记后的信息可用于通信、存储或展示 标记后的信息更利于程序理解和运用 ...

January 26, 2026 · ☕☕ 5 min · 📄 2.0k 字 · Python爬虫

Beautiful Soup 库入门

Beautiful Soup 库入门 官网:https://www.crummy.com/software/BeautifulSoup/ You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects. Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful: ...

January 25, 2026 · ☕☕ 5 min · 📄 2.3k 字 · Python爬虫

Requests 库入门

https://python-requests.org/ Requests 库入门 安装:pip install requests 基本使用 python 1 2 3 4 5 6 import requests r = requests.get("http://www.baidu.com") r.status_code 200 r.encoding = 'utf-8' r.text ...

January 25, 2026 · ☕☕☕☕ 11 min · 📄 5.2k 字 · Python爬虫