Waiting for Godot

So I wrote a little script again to scrap content from Samuel Beckett’s webpage.

ESTRAGON: Nothing to be done.
VLADIMIR: I’m beginning to come round to that opinion. All my life I’ve tried to put it from me, saying Vladimir, be reasonable, you haven’t yet tried everything. And I resumed the struggle. So there you are again.
ESTRAGON: Am I?
VLADIMIR: I’m glad to see you back. I thought you were gone forever.
ESTRAGON: Me too.
VLADIMIR: Together again at last! We’ll have to celebrate this. But how? Get up till I embrace you.
ESTRAGON: Not now, not now.
VLADIMIR: May one inquire where His Highness spent the night?
ESTRAGON: In a ditch.
VLADIMIR: A ditch! Where?
ESTRAGON: Over there.

https://github.com/thetobysiu/Waiting-for-Godot-Processed-Text

There are total of 2 acts and therefore 2 pages.

http://www.samuel-beckett.net/Waiting_for_Godot_Part1.html

http://www.samuel-beckett.net/Waiting_for_Godot_Part2.html

The original HTML has some missing tags and wrong encoding. Chrome seem to fix it when parsing it and I save the version fixed by Chrome, and change the encoding to UTF-8. They are saved as wfg1.html and wfg2.html

The two saved HTML pages then parsed with BeautifulSoup. And I wrote some code to eliminate the non-dialog parts. Luckily they are either italic or wrapped in a blockquote tag.

with open('wfg2.txt', 'w', encoding='utf-8') as f:
    current_text = ''
    current_speaker = ''
    unwanted_tags = ['i', 'blockquote', 'ul']
    for tag in main:
        if tag.name == 'dt':
            current_text = ''
            current_speaker = tag.b.text

            # italic is not conversation
            temp_tag = tag.find_next_sibling('dd', recursive=False)
            while temp_tag and temp_tag.name == 'dd':
                if not temp_tag.find_all('div') and not temp_tag.find_all('font', {'face': 'arial'}):
                    for inner_tag in temp_tag:
                        if type(inner_tag) is NavigableString:
                            current_text += inner_tag
                        else:
                            if all([inner_tag.name != tag_name for tag_name in unwanted_tags]) and all([not inner_tag.find_all(tag_name) for tag_name in unwanted_tags]):
                                current_text += inner_tag.text
                temp_tag = temp_tag.next_sibling

            if tag.find('img'):
                # directly creat an entry
                current_text = ''.join([inner_tag if type(inner_tag) is NavigableString else inner_tag.text for inner_tag in tag if inner_tag.name != 'b'])

            current_text = current_text.replace('\n', '').replace('. . .', '…').replace('#', '')
            # clear anything inside bracket
            current_text = re.sub(r'\((.*?)\)', '', current_text)
            # clear remaining brackets
            current_text = re.sub(r'[\(\)]', '', current_text)
            current_text = re.sub(r' +', ' ', current_text)
            current_text = current_text.replace('… …', '…').strip()
            if current_text:
                f.write(current_speaker + ' ' + current_text + '\n')

So the logic is that:

when it’s in a <dt> tag, it must be the speaker. The <dd> tag follow be it belongs to the previous found <dd>. Therefore, there is a while loop to append every <dd> tag found after the <dt> to a list indicating the speaker.

Of course there is an exception. One particular line contains an image inside it and the text is actually inside the <dt> tag as well. That’s why if an <img> tag is found inside a <dt> tag, it must be that strange sentence.

I just tested and tested until it’s correct. And I successfully clean this text. Better than vacuuming my home.