问题:
I am trying to scrape a web page and store the results in a csv/excel file. I am using beautiful soup for this.
I am trying to extract the data from a soup , using the fi...
可以将文章内容翻译成中文,广告屏蔽插件会导致该功能失效:
问题:
I am trying to scrape a web page and store the results in a csv/excel file. I am using beautiful soup for this.
I am trying to extract the data from a soup , using the find_all function, but I am not sure how to capture the data in the field name or title
The HTML file has the following format
<h3 class="font20">
<span itemprop="position">36.</span>
<a class="font20 c_name_head weight700 detail_page"
href="/companies/view/1033/nimblechapps-pvt-ltd" target="_blank"
title="Nimblechapps Pvt. Ltd.">
<span itemprop="name">Nimblechapps Pvt. Ltd. </span>
</a> </h3>
This is my code so far. Not sure how to proceed from here
from bs4 import BeautifulSoup as BS
import requests
page = 'https://www.goodfirms.co/directory/platform/app-development/iphone?
page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all(class_ = 'font20 c_name_head weight700 detail_page')
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700
detail_page'})
I have tried using the following -
Input: cont.h3.a.span
Output: <span itemprop="name">Nimblechapps Pvt. Ltd.</span>
I want to extract the name of the company - "Nimblechapps Pvt. Ltd."
回答1:
You can use a list comprehension for that:
from bs4 import BeautifulSoup as BS
import requests
page = 'https://www.goodfirms.co/directory/platform/app-development/iphone?page=2'
res = requests.get(page)
cont = BS(res.content, "html.parser")
names = cont.find_all('a' , attrs = {'class':'font20 c_name_head weight700 detail_page'})
print([n.text for n in names])
You will get:
['Nimblechapps Pvt. Ltd.', (..) , 'InnoApps Technologies Pvt. Ltd', 'Umbrella IT', 'iQlance Solutions', 'getyoteam', 'JetRuby Agency LTD.', 'ONLINICO', 'Dedicated Developers', 'Appingine', 'webnexs']
回答2:
Same thing but using descendant combinator " "
to combine the type selector a
with attribute = value selector [itemprop="name"]
names = [item.text for item in cont.select('a [itemprop="name"]')]
回答3:
Try not to use compound classes within the script as they are prone to break. The following script should fetch you the required content as well.
import requests
from bs4 import BeautifulSoup
link = "https://www.goodfirms.co/directory/platform/app-development/iphone?page=2"
res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
for items in soup.find_all(class_="commoncompanydetail"):
names = items.find(class_='detail_page').text
print(names)