이번 챕터에서 배운 전처리 방법들을 실제 데이터에 적용해 보겠습니다.
실습 준비하기
실습에는 IMDb 영화 리뷰 데이터를 사용하겠습니다. IMDb는 The Internet Movie Database의 약자로, 약 200만개 이상의 영화 관련 정보들이 저장되어 있는 데이터 베이스입니다.
실습에는 IMDb에 있는 데이터 중 10개만 가져와서 사용하겠습니다. 아래 imdb.tsv 파일을 확인해 주세요.
imdb.tsv
review
0 "Watching Time Chasers, it obvious that it was made by a bunch of friends. Maybe they were sitting around one day in film school and said, \""Hey, let's pool our money together and make a really bad movie!\"" Or something like that. What ever they said, they still ended up making a really bad movie--dull story, bad script, lame acting, poor cinematography, bottom of the barrel stock music, etc. All corners were cut, except the one that would have prevented this film's release. Life's like that."
1 I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses' home and rapes, tortures and kills various women. It is in black and white but saves the colour for one shocking shot. At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene. Avoid.
2 Minor Spoilers In New York, Joan Barnard (Elvire Audrey) is informed that her husband, the archeologist Arthur Barnard (John Saxon), was mysteriously murdered in Italy while searching an Etruscan tomb. Joan decides to travel to Italy, in the company of her colleague, who offers his support. Once in Italy, she starts having visions relative to an ancient people and maggots, many maggots. After shootings and weird events, Joan realizes that her father is an international drug dealer, there are drugs hidden in the tomb and her colleague is a detective of the narcotic department. The story ends back in New York, when Joan and her colleague decide to get married with each other, in a very romantic end. Yesterday I had the displeasure of wasting my time watching this crap. The story is so absurd, mixing thriller, crime, supernatural and horror (and even a romantic end) in a non-sense way. The acting is the worst possible, highlighting the horrible performance of the beautiful Elvire Audrey. John Saxon just gives his name to the credits and works less than five minutes, when his character is killed. The special effects are limited to maggots everywhere. The direction is ridiculous. I lost a couple of hours of my life watching 'Assassinio al Cimitero Etrusco'. If you have the desire or curiosity of seeing this trash, choose another movie, go to a pizzeria, watch TV, go sleep, navigate in Internet, go to the gym, but do not waste your time like I did. My vote is two. Title (Brazil): 'O Mist챕rio Etrusco' ('The Etruscan Mystery')
3 I went to see this film with a great deal of excitement as I was at school with the director, he was even a good friend of mine for a while. But sorry mate, this film stinks. I can only talk about what was wrong with the first half because that's when I walked out and went to the pub for a much needed drink: 1) someone's standing on a balcony about to jump and so you send a helicopter to shine a searchlight on them??? I don't think so - nothing would make them more likely to jump. 2) local radio doesn't send reporters to cover people about to attempt suicide - again for fear of pressuring them into jumping - or for fear of encouraging copy-cat instances. 3) whatever the circumstances, radio reporters don't do live broadcasts from the 10th floor of a tower block. Radio cars don't carry leads long enough to connect the microphone and headphones to the transmitter. 4) the stuck in the lift scene was utterly derivative 5) the acting and direction was almost non existent.I could go on, but I won't.
4 "Yes, I agree with everyone on this site this movie is VERY VERY bad. To even call this a movie is an insult to all movies ever made. It's 40 minutes long. Someone compares this movie to an after school special. B-I-N-G-O! That describes is perfectly. The packaging for this movie intentionally is misleading. For example, the title of this movie should describe the movie. Rubberface??? That should be the first hint. It was retitled with a new package of some goofy face Jim probably made in his stand-up days. I was hoping for more stand-up from Jim. If you like Jim now as an actor. You would love him in his stand up days. Still trying to locate the Rodney Dangerfield Young Comedians Special from HBO that featured Jim in his early career days. It isn't even mentioned on this site. I'd love to find anything Jim did stand-up wise. Also Jim Carrey is a supporting actor in this movie. The main character is VERY VERY annoying. She is some girl lacking self confidence but yet wants to be a stand up comedian. Jim is there to say lines like \""That's Funny Janet\"" and \""You really are talented\"". And honestly she is terrible really terrible. And the movie is terrible. Beware of false advertising and a really bad movie."
5 "Jennifer Ehle was sparkling in \""Pride and Prejudice.\"" Jeremy Northam was simply wonderful in \""The Winslow Boy.\"" With actors of this caliber, this film had to have a lot going for it. Even those who were critical of the movie spoke of the wonderful sequences involving these two. I was eager to see it. It is with bitter disappointment, however, that I must report that this flick is a piece of trash. The scenes between Ehle and Northam had no depth or tenderness or real passion; they consisted of hackneyed and unsubtle latter-day cinematic lust--voracious open-mouthed kissing and soft-porn humping. Lust can be entertaining if it's done with originality; this was tasteless and awful. Ehle and Northam have sullied their craft; they should be ashamed. As for the modern part of the romance, I was unnerved by the effeminate appearance of the male lead. Aren't there any masculine men left in Hollywood? The plot was kind of interesting; with a better script and a more imaginative director, it might have worked. 1/10"
6 Amy Poehler is a terrific comedian on Saturday Night Live, but her role in this movie doesn't give her anything to work with. Her character, a publisher's representative guiding a new author on a book tour, is mean, not funny. Susan Sarandon plays the author's mother who is involved with the sadistic gym teacher (Billy Bob Thorton) the author had when he was a chubby junior high student. Unfortunately her role doesn't require a talented actress. The funniest thing is the way she looks in the awful gown she wears as the queen of the corn festival. There is no explanation of why the corn queen is old enough to have a grown son. The plot is the stale one of an author who wrote a best selling self-help book and then adopts behavior that contradicts his advice. Still, it is not the worst movie I've ever seen, and I didn't erase it before watching it.
7 "A plane carrying employees of a large biotech firm--including the CEO's daughter--goes down in thick forest in the Pacific Northwest. When the search and rescue mission is called off, the CEO, Harlan Knowles (Lance Henriksen), puts together a small ragtag group to execute their own search and rescue mission. But just what is Knowles searching for and trying to rescue, and just what is following and watching them in the woods? Oy, what a mess this film was! It was a shame, because for one, it stars Lance Henriksen, who is one of my favorite modern genre actors, and two, it could have easily been a decent film. It suffers from two major flaws, and they're probably both writer/director Jonas Quastel's fault--this film (which I'll be calling by its aka of Sasquatch) has just about the worst editing I've ever seen next to Alone in the Dark (2005), and Quastel's constant advice for the cast appears to have been, \""Okay, let's try that again, but this time I want everyone to talk on top of each other, improvise non-sequiturs and generally try to be as annoying as possible\"".The potential was there. Despite the rip-off aspects (any material related to the plane crash was obviously trying to crib The Blair Witch Project (1999) and any material related to the titular monster was cribbing Predator (1987)), Ed Wood-like exposition and ridiculous dialogue, the plot had promise and potential for subtler and far less saccharine subtexts. The monster costume, once we actually get to see it, was more than sufficient for my tastes. The mixture of character types trudging through the woods could have been great if Quastel and fellow writer Chris Lanning would have turned down the stereotype notch from 11 to at least 5 and spent more time exploring their relationships. The monster's \""lair\"" had some nice production design, specifically the corpse decorations ala a more primitive Jeepers Creepers (2001). If it had been edited well, there were some scenes with decent dialogue that could have easily been effective. But the most frightening thing about Sasquatch is the number of missteps made: For some reason, Quastel thinks it's a good idea to chop up dialogue scenes that occur within minutes of each other in real time so that instead we see a few lines of scene A, then a few lines of scene B, then back to A, back to B, and so on. For some reason, he thinks it's a good idea to use frequently use black screens in between snippets of dialogue, whether we need the idea of an unspecified amount of time passing between irrelevant comments or whether the irrelevant comments seem to be occurring one after the other in time anyway. For some reason, he doesn't care whether scenes were shot during the morning, afternoon, middle of the night, etc. He just cuts to them at random. For that matter, the scenes we're shown appear to be selected at random. Important events either never or barely appear, and we're stuck with far too many pointless scenes. For some reason, he left a scene about cave art in the film when it either needs more exposition to justify getting there, or it needs to just be cut out, because it's not that important (the monster's intelligence and \""humanity\"" could have easily been shown in another way). For some reason, there is a whole character--Mary Mancini--left in the script even though she's superfluous. For some reason we suddenly go to a extremely soft-core porno scene, even though the motif is never repeated again. For some reason, characters keep calling Harlan Knowles \""Mr. H\"", like they're stereotypes of Asian domestics. For some reason, Quastel insists on using the \""Blurry Cam\"" and \""Distorto-Cam\"" for the monster attack scenes, even though the costume doesn't look that bad, and it would have been much more effective to put in some fog, a subtle filter, or anything else other than bad cinematography. I could go on, but you get the idea. I really wanted to like this film better than I did혰I'm a Henriksen fan, I'm intrigued by the subject, I loved the setting, I love hiking and this is basically a hiking film on one level--but I just couldn't. Every time I thought it was \""going to be better from this point until the end\"", Quastel made some other awful move. In the end, my score was a 3 out of 10."
8 A well made, gritty science fiction movie, it could be lost among hundreds of other similar movies, but it has several strong points to keep it near the top. For one, the writing and directing is very solid, and it manages for the most part to avoid many sci-fi cliches, though not all of them. It does a good job of keeping you in suspense, and the landscape and look of the movie will appeal to sci-fi fans. If you're looking for a masterpiece, this isn't it. But if you're looking for good old fashioned post-apoc, gritty future in space sci-fi, with good suspense and special effects, then this is the movie for you. Thoroughly enjoyable, and a good ending.
9 "Incredibly dumb and utterly predictable story of a rich teen girl who, not given love by her parents, starts a girl gang. They rob gas stations, rape guys (!!!) and kill a policeman. All the \""teenagers\"" in this film are easily in their late 20s/early 30s, the acting is all horrible and the script has every cliche imaginable with hilarious dialogue--it comes as no surprise that it was written by the immortal Ed Wood Jr.! Worth seeing for laughs. Best lines--\""They're shooting back!\"" and \""It ain't supposed to be like this.\"""
tsv(Tab-separated values)는 데이터를 탭 기준으로 구분한 파일 형식입니다. 그동안 많이 사용했던 csv(Comma-separated values)는 콤마로 데이터를 구분했었죠? 하지만 자연어 데이터에는 콤마가 들어있는 경우가 많기 때문에 csv를 데이터 구분의 기준으로 사용하면 코퍼스의 형태가 망가질 수 있습니다. 그래서 자연어 데이터를 저장할 때에는 tsv 형식을 많이 사용합니다.
tsv 파일을 불러올 때에도 동일하게 read_csv() 함수를 사용합니다. 대신, delimiter 속성을 '\t'로 지정해야 합니다.
import pandas as pd
df = pd.read_csv('imdb.tsv', delimiter='\\t')
데이터의 형태는 다음과 같습니다.
df
실무에서는 자연어 데이터를 보통 Pandas 데이터 프레임 형태로 처리합니다. 본 레슨에서도 실무 상황과 동일하게 Pandas 데이터 프레임의 각 로우에 코퍼스들을 저장하고 실습을 진행하겠습니다.
대소문자 통합
가장 먼저 정규화를 위해 코퍼스의 대소문자를 통합해 주겠습니다. 앞선 정규화 레슨에서 대소문자 통합은 보통 대문자를 소문자로 바꾼다고 했었죠? 해당 과정을 진행해 보겠습니다.
df['review'] = df['review'].str.lower()
df[’review’]는 Pandas 시리즈 형식의 데이터인데요. 해당 데이터 형식으로 저장되어 있는 문자열들을 소문자로 변환하려면 str.lower() 를 사용하면 됩니다.
소문자로 정규화가 잘 됐는지 확인해 보겠습니다.
print(df['review'][0])
모든 단어가 소문자로 잘 정규화 됐습니다.
단어 토큰화
다음으로 전체 코퍼스를 단어로 토큰화해 보겠습니다. df['review']에 있는 모든 로우에 word_tokenized() 함수를 적용하면 되는데요. apply() 함수를 사용하면 그 작업을 쉽게 처리할 수 있습니다. apply()는 파라미터로 함수 이름을 전달하여 데이터 프레임 전체에 동일한 함수를 적용시켜 줍니다. 사용 방법은 아래와 같습니다.
df['word_tokens'] = df['review'].apply(word_tokenize)
단어 토큰화가 잘 됐는지 확인해 볼게요.
print(df['word_tokens'][0])
토큰화가 잘 됐습니다.
데이터 정제
데이터 정제 방법으로는 등장 빈도, 단어 길이, 불용어 세트를 사용하는 방법을 배웠었죠? 해당 내용들을 모두 활용해 볼게요. 각 코퍼스별로 등장 빈도가 1회 이하, 단어의 길이가 2 이하, 그리고 NLTK에서 기본 제공하는 불용어에 해당하는 단어들을 정제해 보겠습니다.
해당 과정은 preprocess.py 파일에 만들어 둔 함수 clean_by_freq(), clean_by_len(), clean_by_stopwords()를 사용하면 쉽게 처리할 수 있는데요. 불러오기 전에 꼭 아래 코드를 먼저 실행해야 합니다.
%load_ext autoreload
%autoreload 2
ipynb 파일에서 직접 만든 파이썬 모듈(.py)을 불러와 사용할 때, 파이썬 모듈 파일이 중간에 수정되면 해당 내용이 자동으로 반영되지 않는 문제가 있습니다. 그래서, preprocess.py 파일을 수정할 때마다 주피터 노트북의 커널을 Restart해야 하는 번거로움이 있는데요. 그런 번거로움을 줄이기 위해 위의 코드를 먼저 실행해야 합니다.
그러면 불러온 함수들을 df['word_tokens']에 apply()로 적용해 볼게요.
# .py 모듈 수정 시 자동 리로드
%load_ext autoreload
%autoreload 2
from processing import clean_by_freq
from processing import clean_by_len
from processing import clean_by_stopwords
stopwords_set = set(stopwords.words('english'))
df['cleaned_tokens'] = df['word_tokens'].apply(lambda x: clean_by_freq(x, 1))
df['cleaned_tokens'] = df['cleaned_tokens'].apply(lambda x: clean_by_len(x, 2))
df['cleaned_tokens'] = df['cleaned_tokens'].apply(lambda x: clean_by_stopwords(x, stopwords_set))
만들어 둔 함수들을 데이터 프레임에 적용할 때 처음 보는 표현 방식이 사용됐는데요. lambda 파라미터: 표현식 형태로 사용된 이 부분을 람다 함수라고 부릅니다.
예를 들어 아래와 같이 파라미터로 받은 두 숫자를 더하는 함수가 있다고 가정해 볼게요.
def plus(a, b):
return a+b
해당 함수는 람다 함수로 아래와 같이 표현할 수 있습니다.
lambda x, y: x + y
여러 줄로 작성해야 하는 함수의 내용을 람다 함수를 사용하면 한 줄로 간단하게 함수를 표현할 수 있습니다.
위에서 사용된 예시로도 생각해 볼게요. 아래의 람다 함수는 clean_by_freq() 함수를 실행해 주는 하나의 함수입니다. 매개변수 x에 데이터프레임의 각 행에 있는 데이터가 들어와서 clean_by_freq(x, 1)을 실행한 결과를 리턴하는 함수인거죠.
lambda x: clean_by_freq(x ,1)
apply()는 파라미터로 적용할 함수 이름 하나만 넣을 수 있기 때문에 clean_by_freq()처럼 두 개 이상의 파라미터가 필요한 함수를 써야 할 때에는 해당 함수를 실행하는 또 다른 함수를 람다식 형태로 만들어서 사용할 수 있습니다.
적용한 결과도 한번 확인해 보겠습니다.
print(df['cleaned_tokens'][0])
길이, 빈도, 불용어 기준으로 잘 정제됐습니다. 실제로 분석에 중요하게 사용될 단어들만 남았네요.
어간 추출
마지막으로 어간 추출을 이용해 정규화해 보겠습니다. 어간 추출 포스팅에서 만들었던 stemming_by_porter() 함수를 df['cleaned_tokens']에 적용해 볼게요.
from processing import stemming_by_porter
df['stemmed_tokens'] = df['cleaned_tokens'].apply(stemming_by_porter)
결과도 확인해 보겠습니다.
print(df['stemmed_tokens'][0])
이전 포스팅에서 어간 추출의 결과로는 사전에 없는 단어가 나올 수 있으니 사용할 때 주의해야 한다고 했었습니다. 이 경우에도 'realli', 'movi' 처럼 잘못된 단어가 결과로 나오네요. 이렇게 잘못된 단어가 나올수도 있기 때문에, 어간 추출이 해당 코퍼스를 이용한 분석에 도움이 될지 잘 판단해서 사용해야 합니다. 이점 꼭 기억해 주세요.
출처 코드잇
'Data Analysis > Natural Language Processing(NLP)' 카테고리의 다른 글
어간 추출(Stemming) (0) | 2023.06.08 |
---|---|
정규화(Normalization) (0) | 2023.06.08 |
불용어 제거 실습 (0) | 2023.06.08 |
불용어(Stopwords) (0) | 2023.06.08 |
정제 실습 (0) | 2023.06.06 |