So I wrote a little script again to scrap content from Samuel Beckett’s webpage.
ESTRAGON: Nothing to be done.
VLADIMIR: I’m beginning to come round to that opinion. All my life I’ve tried to put it from me, saying Vladimir, be reasonable, you haven’t yet tried everything. And I resumed the struggle. So there you are again.
ESTRAGON: Am I?
VLADIMIR: I’m glad to see you back. I thought you were gone forever.
ESTRAGON: Me too.
VLADIMIR: Together again at last! We’ll have to celebrate this. But how? Get up till I embrace you.
ESTRAGON: Not now, not now.
VLADIMIR: May one inquire where His Highness spent the night?
ESTRAGON: In a ditch.
VLADIMIR: A ditch! Where?
ESTRAGON: Over there.
https://github.com/thetobysiu/Waiting-for-Godot-Processed-Text
There are total of 2 acts and therefore 2 pages.
http://www.samuel-beckett.net/Waiting_for_Godot_Part1.html
http://www.samuel-beckett.net/Waiting_for_Godot_Part2.html
The original HTML has some missing tags and wrong encoding. Chrome seem to fix it when parsing it and I save the version fixed by Chrome, and change the encoding to UTF-8. They are saved as wfg1.html and wfg2.html
The two saved HTML pages then parsed with BeautifulSoup. And I wrote some code to eliminate the non-dialog parts. Luckily they are either italic or wrapped in a blockquote tag.
with open('wfg2.txt', 'w', encoding='utf-8') as f:
current_text = ''
current_speaker = ''
unwanted_tags = ['i', 'blockquote', 'ul']
for tag in main:
if tag.name == 'dt':
current_text = ''
current_speaker = tag.b.text
# italic is not conversation
temp_tag = tag.find_next_sibling('dd', recursive=False)
while temp_tag and temp_tag.name == 'dd':
if not temp_tag.find_all('div') and not temp_tag.find_all('font', {'face': 'arial'}):
for inner_tag in temp_tag:
if type(inner_tag) is NavigableString:
current_text += inner_tag
else:
if all([inner_tag.name != tag_name for tag_name in unwanted_tags]) and all([not inner_tag.find_all(tag_name) for tag_name in unwanted_tags]):
current_text += inner_tag.text
temp_tag = temp_tag.next_sibling
if tag.find('img'):
# directly creat an entry
current_text = ''.join([inner_tag if type(inner_tag) is NavigableString else inner_tag.text for inner_tag in tag if inner_tag.name != 'b'])
current_text = current_text.replace('\n', '').replace('. . .', '…').replace('#', '')
# clear anything inside bracket
current_text = re.sub(r'\((.*?)\)', '', current_text)
# clear remaining brackets
current_text = re.sub(r'[\(\)]', '', current_text)
current_text = re.sub(r' +', ' ', current_text)
current_text = current_text.replace('… …', '…').strip()
if current_text:
f.write(current_speaker + ' ' + current_text + '\n')
So the logic is that:
when it’s in a <dt> tag, it must be the speaker. The <dd> tag follow be it belongs to the previous found <dd>. Therefore, there is a while loop to append every <dd> tag found after the <dt> to a list indicating the speaker.
Of course there is an exception. One particular line contains an image inside it and the text is actually inside the <dt> tag as well. That’s why if an <img> tag is found inside a <dt> tag, it must be that strange sentence.
I just tested and tested until it’s correct. And I successfully clean this text. Better than vacuuming my home.