Pythonでhtmlのファイルを読み込む方法[Python][Beautifulsoup]

from bs4 import BeautifulSoup
import re
links = []
html_path = "./hoge.html"
with open(html_path) as f:
html = f.read()
soup = BeautifulSoup(html)
parsed_links = soup.find_all("a")
for link in parsed_links:
TargetLink = link.get("href")
if re.search("http://", TargetLink):
links.append(TargetLink)

linksに所望のリンクが溜まっていく。

コツは、一度 open で html を開いてから、beautifulsoupでパースしていることだ。

re.search の部分は、余計なごみを省くために書いている。

関連記事