How to extract data from HTML using beuatiful soup

问题: I am trying to scrape a web page and store the results in a csv/excel file. I am using beautiful soup for this. I am trying to extract the data from a soup , using the fi...

问题:

I am trying to scrape a web page and store the results in a csv/excel file. I am using beautiful soup for this.

I am trying to extract the data from a soup , using the find_all function, but I am not sure how to capture the data in the field name or title

The HTML file has the following format

<h3 class="font20">
 <span itemprop="position">36.</span> 
 <a class="font20 c_name_head weight700 detail_page" 
 href="/companies/view/1033/nimblechapps-pvt-ltd" target="_blank" 
 title="Nimblechapps Pvt. Ltd."> 
     <span itemprop="name">Nimblechapps Pvt. Ltd. </span>
</a> </h3>

This is my code so far. Not sure how to proceed from here

from bs4 import BeautifulSoup as BS
import requests 
page = 'https://www.goodfirms.co/directory/platform/app-development/iphone? 
page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all(class_ = 'font20 c_name_head weight700 detail_page')
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700 
detail_page'})

I have tried using the following -

Input: cont.h3.a.span
Output: <span itemprop="name">Nimblechapps Pvt. Ltd.</span>

I want to extract the name of the company - "Nimblechapps Pvt. Ltd."


回答1:

You can use a list comprehension for that:

from bs4 import BeautifulSoup as BS
import requests

page = 'https://www.goodfirms.co/directory/platform/app-development/iphone?page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700 detail_page'})
print([n.text for n in names])

You will get:

['Nimblechapps Pvt. Ltd.', (..) , 'InnoApps Technologies Pvt. Ltd', 'Umbrella IT', 'iQlance Solutions', 'getyoteam', 'JetRuby Agency LTD.', 'ONLINICO', 'Dedicated Developers', 'Appingine', 'webnexs']

回答2:

Same thing but using descendant combinator " " to combine the type selector a with attribute = value selector [itemprop="name"]

names = [item.text for item in cont.select('a [itemprop="name"]')]

回答3:

Try not to use compound classes within the script as they are prone to break. The following script should fetch you the required content as well.

import requests
from bs4 import BeautifulSoup

link = "https://www.goodfirms.co/directory/platform/app-development/iphone?page=2"

res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
for items in soup.find_all(class_="commoncompanydetail"):
    names = items.find(class_='detail_page').text
    print(names)
  • 发表于 2018-12-29 08:53
  • 阅读 ( 226 )
  • 分类:网络文章

条评论

请先 登录 后评论
不写代码的码农
小编

篇文章

作家榜 »

  1. 小编 文章
返回顶部
部分文章转自于网络,若有侵权请联系我们删除