Bypass em tags when extracting contents of class name using Parsel selector

问题: I'm trying to extract the contents of the class name. How to do I extract all the contents including the ones inside the 'em' tags and after the 'em' tags too? See picture...

问题:

I'm trying to extract the contents of the class name. How to do I extract all the contents including the ones inside the 'em' tags and after the 'em' tags too? See picture below:

enter image description here I tried the following and these were the results:

Trial 1:

driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
sel.xpath("//*[@class ='st']").extract()

Output 1:

>> <span class="st"><span class="f">Nov 26, 2018 - </span>First #<em>GDPR fine</em> awarded in Germany. 330,000 user data stolen. Usernames and passwords stored in plaintext. €20,000 <em>fine</em>. Why "so low"?</span>

Trial 2:

driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
sel.xpath("//*[@class ='st']/text()").extract()

Output 2:

>> First #

Ideally, the output I want to get is:

>> Nov 26, 2018 - First #GDPR fine awarded in Germany. 330,000 user data stolen. Usernames and passwords stored in plaintext. €20,000 fine. Why "so low"?

回答1:

I eventually found a way to solve the problem though not an elegant one, would still welcome a more elegant solution.

I pulled in the contents of the class name using:

 driver = webdriver.Chrome(options=options)
 sel = Selector(text = driver.page_source)
 content = sel.xpath("//*[@class ='st']").extract()

I then defined a function that stripped the html away from the text:

import html.parser

class HTMLTextExtractor(html.parser.HTMLParser):
    def __init__(self):
        super(HTMLTextExtractor, self).__init__()
        self.result = [ ]

    def handle_data(self, d):
       self.result.append(d)

    def get_text(self):
       return ''.join(self.result)

    def html_to_text(html):
        s = HTMLTextExtractor()
        s.feed(html)
        return s.get_text()

Looping through the contents in the list and stripping the html one at a time gave me the result I wanted:

  m = []
  for w in content:
      z = html_to_text(w)
      m.append(z)
  • 发表于 2019-03-27 20:29
  • 阅读 ( 192 )
  • 分类:sof

条评论

请先 登录 后评论
不写代码的码农
小编

篇文章

作家榜 »

  1. 小编 文章
返回顶部
部分文章转自于网络,若有侵权请联系我们删除