爬虫基础 | xvsf

复杂结构数据的获取

PubMed 单篇文献基本信息获取 https://pubmed.ncbi.nlm.nih.gov/33883728/ python 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 import requests from lxml import etree url = "https://pubmed.ncbi.nlm.nih.gov/33883728/" r = requests.get(url).text html = etree.HTML(r) title = html.xpath('//*[@id="full-view-heading"]/h1/text()')[0].strip() print(title) authors = html.xpath('//*[@id="full-view-heading"]/div[2]/div/div/span/a/text()') authors = ','.join(authors) print(authors) pmID = html.xpath('//*[@id="full-view-identifiers"]/li[1]/span/strong/text()')[0] print(pmID) mag = html.xpath('//*[@id="full-view-journal-trigger"]/text()')[0].strip() print(mag) info = html.xpath('//*[@id="full-view-heading"]/div[1]/div[2]/span[2]/text()')[0].split(';') year = info[0][:4] info = info[1] print(info) print(year) abstract = html.xpath('//*[@id="eng-abstract"]/p/text()')[0].strip() print(abstract) try: kw = html.xpath('/html/body/div[5]/main/div[2]/p/text()')[1].strip() print(kw) except: pass PubMed 多篇文献基本信息获取文章对应链接的获取在搜索页中，默认为十篇，先爬取一篇文章的链接 ...

页面数据爬取

简易模板 python 1 2 3 4 5 6 7 8 9 10 11 import requests from bs4 import BeautifulSoup meHeader = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"} url = "" def getOne(url): r = requests.get(url, headers = myHeader).content.decode('utf-8') soup = BeautifulSoup(r, 'html.parser') t = soup.find_all() 单封家书【译文】内容获取目标网站:http://ewenyan.com/articles/zgfjs/1.html ...

Re 库入门

正则表达式 regular expression, regex, RE 正则表达式是用来简洁表达一组字符串的表达式正则表达式是一种针对字符串表达“简洁”和“特征”思想的工具正则表达式可以用来判断某字符串的特征归属 ...

信息标记与提取方法

信息标记的三种形式信息的标记标记后的信息可形成信息组织结构，增加了信息维度标记的结构与信息一样具有重要价值标记后的信息可用于通信、存储或展示标记后的信息更利于程序理解和运用 ...

Beautiful Soup 库入门

Beautiful Soup 库入门官网：https://www.crummy.com/software/BeautifulSoup/ You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it’s been saving programmers hours or days of work on quick-turnaround screen scraping projects. Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful: Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t detect one. Then you just have to specify the original encoding. Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility. Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it “Find all the links”, or “Find all the links of class externalLink”, or “Find all the links whose urls match “foo.com”, or “Find the table heading that’s got bold text, then give me that text.” ...

Requests 库入门

https://python-requests.org/ Requests 库入门安装：pip install requests 基本使用 python 1 2 3 4 5 6 import requests r = requests.get("http://www.baidu.com") r.status_code 200 r.encoding = 'utf-8' r.text ...