数据提取

PubMed 单篇文献基本信息获取 https://pubmed.ncbi.nlm.nih.gov/33883728/ python 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 import requests from lxml import etree url = "https://pubmed.ncbi.nlm.nih.gov/33883728/" r = requests.get(url).text html = etree.HTML(r) title = html.xpath('//*[@id="full-view-heading"]/h1/text()')[0].strip() print(title) authors = html.xpath('//*[@id="full-view-heading"]/div[2]/div/div/span/a/text()') authors = ','.join(authors) print(authors) pmID = html.xpath('//*[@id="full-view-identifiers"]/li[1]/span/strong/text()')[0] print(pmID) mag = html.xpath('//*[@id="full-view-journal-trigger"]/text()')[0].strip() print(mag) info = html.xpath('//*[@id="full-view-heading"]/div[1]/div[2]/span[2]/text()')[0].split(';') year = info[0][:4] info = info[1] print(info) print(year) abstract = html.xpath('//*[@id="eng-abstract"]/p/text()')[0].strip() print(abstract) try: kw = html.xpath('/html/body/div[5]/main/div[2]/p/text()')[1].strip() print(kw) except: pass PubMed 多篇文献基本信息获取文章对应链接的获取在搜索页中，默认为十篇，先爬取一篇文章的链接 ...

复杂结构数据的获取

信息标记与提取方法