실습 설명
불용어 제거를 위한 함수 clean_by_stopwords()를 만들어 주세요.
● clean_by_stopwords()는 파라미터로 단어 토큰화된 코퍼스(tokenized_words)와 불용어 목록(stopwords_set)을 받습니다.
● 결과로는 불용어가 제거된 단어 토큰 리스트를 반환합니다.
● 불용어 목록은 NLTK에서 제공하는 기본 불용어 목록 세트를 받아와 사용합니다.
main.py
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from text import TEXT
nltk.download('stopwords')
nltk.download('punkt')
corpus = TEXT
tokenized_words = word_tokenize(TEXT)
# NLTK에서 제공하는 불용어 목록을 세트 자료형으로 받아와 주세요
stopwords_set = set(stopwords.words('english'))
def clean_by_stopwords(tokenized_words, stopwords_set):
cleaned_words = []
for word in tokenized_words:
# 여기에 코드를 작성하세요
if word not in stopwords_set:
cleaned_words.append(word)
return cleaned_words
# 테스트 코드
clean_by_stopwords(tokenized_words, stopwords_set)
text.py
TEXT = """After reading the comments for this movie, I am not sure whether I should be angry, sad or sickened. Seeing comments typical of people who a)know absolutely nothing about the military or b)who base everything they think they know on movies like this or on CNN reports about Abu-Gharib makes me wonder about the state of intellectual stimulation in the world. At the time I type this the number of people in the US military: 1.4 million on Active Duty with another almost 900,000 in the Guard and Reserves for a total of roughly 2.3 million. The number of people indicted for abuses at at Abu-Gharib: Currently less than 20 That makes the total of people indicted .00083% of the total military. Even if you indict every single military member that ever stepped in to Abu-Gharib, you would not come close to making that a whole number. The flaws in this movie would take YEARS to cover. I understand that it's supposed to be sarcastic, but in reality, the writer and director are trying to make commentary about the state of the military without an enemy to fight. In reality, the US military has been at its busiest when there are not conflicts going on. The military is the first called for disaster relief and humanitarian aid missions. When the tsunami hit Indonesia, devestating the region, the US military was the first on the scene. When the chaos of the situation overwhelmed the local governments, it was military leadership who looked at their people, the same people this movie mocks, and said make it happen. Within hours, food aid was reaching isolated villages. Within days, airfields were built, cargo aircraft started landing and a food distribution system was up and running. Hours and days, not weeks and months. Yes there are unscrupulous people in the US military. But then, there are in every walk of life, every occupation. But to see people on this website decide that 2.3 million men and women are all criminal, with nothing on their minds but thoughts of destruction or mayhem is an absolute disservice to the things that they do every day. One person on this website even went so far as to say that military members are in it for personal gain. Wow! Entry level personnel make just under $8.00 an hour assuming a 40 hour work week. Of course, many work much more than 40 hours a week and those in harm's way typically put in 16-18 hour days for months on end. That makes the pay well under minimum wage. So much for personal gain. I beg you, please make yourself familiar with the world around you. Go to a nearby base, get a visitor pass and meet some of the men and women you are so quick to disparage. You would be surprised. The military no longer accepts people in lieu of prison time. They require a minimum of a GED and prefer a high school diploma. The middle ranks are expected to get a minimum of undergraduate degrees and the upper ranks are encouraged to get advanced degrees.
"""
실행결과
출처 코드잇
'Data Analysis > Natural Language Processing(NLP)' 카테고리의 다른 글
정규화(Normalization) (0) | 2023.06.08 |
---|---|
자연어 전처리 적용 I (0) | 2023.06.08 |
불용어(Stopwords) (0) | 2023.06.08 |
정제 실습 (0) | 2023.06.06 |
정제(Cleaning) (2) | 2023.06.06 |