POST 방식의 웹 사이트 python crawling 후 pymongo 저장

http://www.cine21.com/rank/person의 정보를 크롤링해서 정보를 가공한 후 pymongo에 저장해보자.

jupyter notebook에서 진행했다.

홈페이지를 살펴보면, POST로 req를 날리기 때문에 url이 바뀌지 않는 것을 볼 수 있다. 관리자 도구의 Network 부분에서 정보를 POST하는 부분을 살펴보면 알 수 있다.

그렇다면 해당 post를 날리기 위한 추가적인 정보 (예를 들면 form data)가 있을 터이다. 역시나 해당 부분을 살펴보면 Form Data를 살펴볼 수 있다.

그렇다면 requests를 할 때 post 방식으로 정보와 함께 날려주면 된다. 아래는 그 예시이다.

# mongoDB connect
connection = pymongo.MongoClient('mongodb://3.34.9.105', 27017)

# virable

url = "http://www.cine21.com/rank/person/content"
data={"section": "actor", "period_start": "2020-05", "gender": "all", "page": 2}

# request(POST)
res = requests.post(url, data)

# soup
soup = BeautifulSoup(res.content, "html.parser")

# crawling
actor_list = soup.find_all("li", {'class':['people_li']})

actor_url =[]
for el in actor_list:
    url = "http://www.cine21.com/" + el.find("div", {'class': ["name"]}).find("a")['href']
    name_before = el.find("div", {'class': ["name"]}).find("a").text
    name = re.sub("\(\w*\)", "", name_before)

    actor_url.append([name, url])

# db 생성
db = connection.actor_db

# 컬렉션 생성
actor = db.actor


for i in actor_url:
    actor.insert_one({"name": i[0], "url": i[1]})

저작자표시

'Crawler > Crawler' 카테고리의 다른 글

WebDriver, WebElements, Waiting (0)	2020.09.07
selenium 설치 및 간단한 이용법 (0)	2020.09.07
requests와 beautiful soup 패키지를 활용한 crawling (0)	2020.05.08

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

darren, dev blog

POST 방식의 웹 사이트 python crawling 후 pymongo 저장

'Crawler > Crawler' 카테고리의 다른 글

티스토리툴바

'Crawler > Crawler' 카테고리의 다른 글

검색

티스토리툴바