Python을 이용한 BeautifulSoup4 사용방법 정리

Python을 이용한 BeautifulSoup4 사용방법 정리
/category/Computer%20Language/Python

2021. 7. 18. 19:30

BeautifulSoup은 html 코드를 Python이 이해하는 객체 구조로 변환하는 Parsing을 맡고 있고,
이 라이브러리를 이용해 우리는 제대로 된 '의미있는' 정보를 추출해 낼 수 있다.

Parser

Parser란 Compiler의 일부로서 원시 프로그램의 명령문이나 온라인 명령문,
HTML 문서 등에서 Markup Tag 등을 입력으로 받아들여서 구분을 해석 할 수 있는 단위로 여러 부분으로 해석해 주는 역할을 한다.
즉 Compiler나 Interpreter에서 원시 프로그램을 읽어 들여, 그 문장이 구조를 알아내는 Parsing을 행하여 주는 프로그램이다.

https://www.crummy.com/software/BeautifulSoup/bs4/doc.ko/
bs4 doc

설치하기
pip install bs4

사용하기
import requests
from bs4 import BeautifulSoup

# HTTP GET Request
req = requests.get('URL')
# HTML 소스 가져오기
html = req.text
# BeautifulSoup으로 html소스를 python객체로 변환하기
# 첫 인자는 html소스코드, 두 번째 인자는 어떤 parser를 이용할지 명시.
# 이 글에서는 Python 내장 html.parser를 이용했다.
soup = BeautifulSoup(html, 'html.parser')

soup 객체에 파싱하고자하는 정보가 담겨있게된다.
이제 soup를 통하여 원하는 정보를 가져오면 끝!

테그이름으로 가져오기
soup.title
soup.body
soup.head
...

속성값 찾기
soup.p['class']

=============================find / find_all======================
특정 태그 찾기
soup.find('a') find의 경우 가장상위에 있는 a태그 하나만을 찾는다.
soup.find_all('a') find_all의 경우 모든 a태그를 찾는다.

find_all limit soup.find_all("a", limit=2)

soup.find_all(["a","p"]) a와 p태그 모두 찾는다

soup.find('a').text
soup.find('a').get_text()
찾은 a태그에 있는 text를 찾는다.

soup.find_all('a','title') a태그와 속성 값이 title 있는거

id로 찾기
soup.find('a', {'id':'bodyContent'})
soup.find_all('a',id='logo')

class로 찾기
soup.find('a',class_='logo')
soup.find('p','cssclass')
soup.find_all('div',{'class':'modal-body'})

속성:속성값으로 찾기
soup.find('p',attrs = {'align' : 'center'})

soup.find.attrs['속성']
해당속성에 있는값을 찾는다.
ex)
...
<a href="/index.php?mid=index">index</a>
<img alt="cafe" src="/images/cafe.png"/>
...
soup.find_all('a')[2].attrs['href'] '/index.php?mid=index'
soup.find_all('img')[0].attrs['src'] '/images/cafe.png'

===================select==================================
select는 CSS selector를 이용해 조건과 일치하는 모든 객체들을 list로 반환

ex)
body > h3:nth-child(4) > a
h3 >a

>사용시 바로 아래 자식만을 검색
selector = soup.select( 'h3 > a' )
selector.text
selector.get('href')

띄어쓰기가 있다면 하위태그를 검색
직계자식이 아니여도 관계없음
selector = soup.select('html title div')[0]
selector.text
selector.get_text()

class 검색
selector = soup.select('div.class_name')[0]

id 검색
selector = soup.select('#id_name')

특정태크 중 id가 myid 인 태그의 하위 태그검색

selector = soup.select('div#myid div#Child tag')[0]

=====================stirng으로 찾기=======================

soup.find_all(string="찾는문자열")