When using BeautifulSoup, html has needed data in a different index number in some search results

问题: I am having an issue with a website's format causing certain information within a container to have different index numbers from one search result to the next. I am scrap...

问题:

I am having an issue with a website's format causing certain information within a container to have different index numbers from one search result to the next.

I am scraping pieces of data from search results. The location/Index Numbers are different in a few cases.

Basically, the exact text I need scraped from the html below is "7XB21".

<dl class="last">
    ::before
    <dt>Part Code:</dt>
    <dd>
        "7XB21"
        <span class="separator">,</span>
    < /dd>
    <dt>Weight:</dt>
    <dd>97</dd>
</dl>

This is easy to do the with Python code below, as it gets me the result I need which is "7XB21"

modelcode_container = container.find_all("dd")
        modelcode = (modelcode_container[5].text)
 

HOWEVER! Some of the HTML code scraped, while being structured the same, lacks some information which the above example shows. Here is an example of the troublesome HTML:

<dl class="last">
    <dt>Stock id:</dt>
    <dd>c12
        <span class="separator">,</span>
    </dd>
    <dt>Part Code:</dt>
    <dd>
        "8B727"
        <span class="separator">,</span>
    </dd>
    <dt>Weight:</dt>
    <dd>102</dd>
</dl>

Do you see the difference? I would need to specify a different index number to capture the proper data which is "8B727" in this case.

I am not sure how to go about setting that up, any help would be appreciated. Thank you!


回答1:

If you are certain that <dt>Part Code:</dt> occurs before that you could use find_next_sibling() to get the dd tag next to it.

soup.find('dt',text="Part Code:").find_next_sibling('dd')
  • 发表于 2019-02-21 07:33
  • 阅读 ( 212 )
  • 分类:sof

条评论

请先 登录 后评论
不写代码的码农
小编

篇文章

作家榜 »

  1. 小编 文章
返回顶部
部分文章转自于网络,若有侵权请联系我们删除