Select li elements from ul with xpath

问题: I'm starting with XPATH from lxml on Python3 and I'm unable to get the right sintaxis to select all li elements with content of a ul. I'm trying with this structure: <...

问题:

I'm starting with XPATH from lxml on Python3 and I'm unable to get the right sintaxis to select all li elements with content of a ul. I'm trying with this structure:

<body>
 <div> ..... </div>
 <div> ..... </div>
 <div id="div-A">
  <div id="subdiv-1">
   <form> ... </form>
   <div> ..... </div>
   <div> ..... </div>
   <ul>
    <li>
     <div id="div-1">
      <div> ..... </div>
      <div> ..... </div>
      <div id="subdiv-1">
       <a class="name">
        <span>
          ....text1....
        </span>
       </a>
      </div>
      <div id="subdiv-2">
       <div class="class-1">
        <div class="subClass-1">
         <div> ....text2.... </div>
        </div>
        <span class="subClass-2">
         ....text3....
        </span>
       </div>
      </div>
     </div>
    </li>
    ... x23...
   </ul>
  </div>
 </div>
</body>

My goal it's to be able to get text1, text2 and text3.

So first, I try to get all li elements with their content:

content = html_response.content
fixed_content = fromstring(content)  # parse the HTML and correct malformed HTML
items = fixed_content.xpath('//ul/li/*')

And pass items to a function with a for loop to iterate over the 23 li elements. Now I try to get the texts, so:

for item in items:
 text1 = item.xpath('/div[@id="div-1"]/div[@id="subdiv-1"]/a[@class="name"]/span').text_content()
 text2 = item.xpath('/div[@id="div-1"]/div[@id="subdiv-2"]/div[@class="class-1"]/div[@class="subClass-1"]/div').text_content()
 text3 = item.xpath('/div[@id="div-1"]/div[@id="subdiv-2"]/div[@class="class-1"]/div[@class="subClass-2"]/span[@class="subClass-2"]').text_content()

But I get on all cases an empty result with no content. What I'm doing wrong?

Regards.


回答1:

Try below code to get required output:

items = fixed_content.xpath('//ul/li//span | //ul/li//div[@class="subClass-1"]')
for item in items:
    item.text_content().strip()

The output is

'....text1....'
'....text2....'
'....text3....'

or

items = fixed_content.xpath('//ul/li') 
for item in items:
    text1 = item.xpath('.//a[@class="name"]/span')[0].text_content().strip()
    text2 = item.xpath('.//div[@class="subClass-1"]')[0].text_content().strip()
    text3 = item.xpath('.//span[@class="subClass-2"]')[0].text_content().strip()

if you want to get each text node as variable


回答2:

Your xpath queries seem to give the wanted output for me. text1, text2 and text3 results when writing them out completely. Using the string() method you are able to select the inner text value of the found element:

//ul/li/div[@id="div-1"]/div[@id="subdiv-1"]/a[@class="name"]/span/string(),
//ul/li/div[@id="div-1"]/div[@id="subdiv-2"]/div[@class="class-1"]/div[@class="subClass-1"]/div/string(),
//ul/li/div[@id="div-1"]/div[@id="subdiv-2"]/div[@class="class-1"]/span[@class="subClass-2"]/string()

Does writing them out and using the string() method not provide the expected text1-3 values for you?


回答3:

[i.strip() for i in tree.xpath('//ul//div[@class="subClass-1"]//text()|//ul//span//text()') if i.strip()]
  • 发表于 2018-09-02 12:25
  • 阅读 ( 252 )
  • 分类:sof

条评论

请先 登录 后评论
不写代码的码农
小编

篇文章

作家榜 »

  1. 小编 文章
返回顶部
部分文章转自于网络,若有侵权请联系我们删除