requests와 beautiful soup 패키지를 활용한 crawling

c내장 모듈인 urllib3를 활용한 requests라는 패키지가 있는데 이걸 이용해서 간단한 scrapping을 만들어보겠다. 또, GET을 통해 가져온 html 내용물들을 soup화 하여 특정 내용만 추출할 것이다.

request 모듈은 axios와 같이 서버와 통신을 도와주는 패키지로, HTTP method인 get, post, delete 등을 조작할 수 있다. QuickGuide를 읽어보자.

beautiful soup는 Python library for pulling data out of HTML and XML files라고 한다. 문서에서 Quickstart 부분을 읽어보자.

psf/requests

A simple, yet elegant HTTP library. Contribute to psf/requests development by creating an account on GitHub.

github.com

Beautiful Soup: We called him Tortoise because he taught us.

www.crummy.com

🚀 설치

pip install requests
pip install beautifulsoup4

🚀 활용

request를 통해 정보를 받아와서 해당 text를 html.parser를 이용해 파서하여 soup화 한 후, 해당 soup에서 find, find_all 등 메서드를 이용하여 원하는 정보를 가공하면 된다.

가급적 한 단위, 컴포넌트를 find_all 한 다음 그 단위 내에서 세부적으로 find 연쇄를 통해 찾은 다음 None handling을 해주면 된다~!

아니면 유니크한 곳을 find로 하나를 찍은 다음에 그 내부에 존재하는 여러 가지를 find_all로 긁어올 수도 있다.

문서를 읽어보고 적절히 활용하자.

아래와 같은 부분을 읽어보면 알겠지만, 방법이 참 많다. 귀찮다고 넘기지 말고 읽어보자.

www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

import requests
from bs4 import BeautifulSoup

URL = f"https://stackoverflow.com/jobs?q=python"


def extract_stackoverflow_pages():
    stackoverflow_result = requests.get(URL)
    soup = BeautifulSoup(stackoverflow_result.text, "html.parser")

    page_list = soup.find("div", class_="s-pagination").find_all("a", class_="s-pagination--item")

    pages = []
    for i in page_list[:-2]:
        page = int(i.find("span").string)
        pages.append(page)

    max_page = pages[-1]

    return max_page


def extract_stackoverflow_jobs(max_page):
    jobs = []

    # 각 페이지마다 crawling을 합니다.
    for i in range(max_page):
        page = i + 1
        print(f"Scrapping stackoverflow page : {page}")
        # print(f"Scrapping stackoverflow page: {page}")
        result = requests.get(f"{URL}&pg={page}")
        soup = BeautifulSoup(result.text, "html.parser")

        job = soup.find_all("div", {"class": "-job"})

        for k in job:

            # 제목 추출
            pre_title = k.find("a", {"class": ["s-link", "stretched-link"]})
            if pre_title is not None:
                title = pre_title.string

            # 로케이션 추출
            pre_location = k.find("h3", {"class": "mb4"}).find("span", {"class": "fc-black-500"})
            if pre_location is not None:
                location = pre_location.string.strip()

            # 회사 추출
            pre_company = k.find("h3", {"class": "mb4"}).find("span")
            if pre_company is not None:
                company = pre_company.string.strip()

            # 링크 추출
            pre_link = k["data-jobid"]
            if pre_link is not None:
                link = "https://stackoverflow.com/jobs/" + pre_link

            jobs.append({"title": title, "location": location, "company": company, "link": link})

    return jobs


def get_stackoverflow_jobs():
    jobs = extract_stackoverflow_jobs(extract_stackoverflow_pages())
    return jobs

from stackoverflow import get_stackoverflow_jobs

print(get_stackoverflow_jobs())
// output : [{'title': 'Python Software Engineer', 'location': 'Amsterdam, Netherlands', 
'company': 'bloomon', 'link': 'https://stackoverflow.com/jobs/286140'}, 
{'title': 'Python Software Engineer (40h/wk)', 'location': 'Utrecht, Netherlands', 
'company': 'Channable', 'link': 'https://stackoverflow.com/jobs/265091'}, 
... (중략)
]

# https://requests.readthedocs.io/en/master/user/quickstart/#quickstart
import requests

# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start
from bs4 import BeautifulSoup

BASE_URL = "https://gall.dcinside.com/mgallery/board/lists"
PARAMS={"id": "nouvellevague", "page": "1"}
USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.152 Safari/537.36"
HEADERS = {'user-agent': USER_AGENT}

result = requests.get(BASE_URL, params=PARAMS, headers=HEADERS, timeout=2.000)

print(result.status_code)

soup = BeautifulSoup(result.text, "html.parser")

# print(soup.prettify())

# 클래스 별로 dom을 얻어 오기 : https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class \
# 아래와 같으면 gall_tit와 ub-word가 정확한 순서로, 두 개 다 있는 것만 가져온다.
td = soup.find_all("td", class_="gall_tit ub-word")
for i in td:
    k = i.find("a").text
    print(k)

🚀 만약 url이 바뀌지 않는 페이지를 정보를 크롤링해야 한다면?

대표적으로 (http://www.cine21.com/rank/person) 이 사이트를 보자.

pagination 부분을 클릭해도 페이징이 넘어가지 않는다.

이런 경우는 정보를 전달하는 방식이 GET이 아니라 POST이기 때문이다.

개발자 도구의 network 부분을 잘 살펴보면 POST method를 사용하고 있음을 확인할 수 있다.

이런 경우에 대처한 프로젝트가 있으니 아래 포스트를 참고하자.

(https://darrengwon.tistory.com/451)

python crawling => pymongo 플로우

darrengwon.tistory.com

저작자표시

'Crawler > Crawler' 카테고리의 다른 글

WebDriver, WebElements, Waiting (0)	2020.09.07
selenium 설치 및 간단한 이용법 (0)	2020.09.07
POST 방식의 웹 사이트 python crawling 후 pymongo 저장 (1)	2020.06.09

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

darren, dev blog