Got a file with some HTML you want to remove?
All you need to do is parse the file with a Python script, using the Python library BeautifulSoup.
If that sounds unfamiliar to you, you should watch Corey Schafers beginner video “Python Tutorial: Web Scraping with BeautifulSoup and Requests“.
If you are familiar with Beautiful Soup, basically it is just a matter of using soup.get_text
.
Requirements
- Install Python 3 or newer…
- Install Beautiful Soup 4
Python script
Here is the whole script we are using. You will need to edit some of it.
- Replace
C:/some_folder
with the folder containing the files you want to remove HTML from. - Replace
ass
with the file type extension for the files you want to remove HTML from. - Run the script from the same folder as the files.
# Import OS so we can write to files
import os
# Import Beautiful Soup 4 so we can parse HTML
from bs4 import BeautifulSoup
# Set the path where the target files are located
path = r'C:/some_folder'
# Set the file extension to look for
ext = 'ass'
# Start a loop - (for each file in the path do...)
for filename in os.listdir(path):
# Set which file types to look for
if filename.endswith(ext):
# Get the file name without the extension
fullpath = os.path.join(path, filename)
# Get the file path including the file and the file extension
filename = os.path.splitext(os.path.basename(filename))[0]
# Parse the file with Beautiful Soup
soup = BeautifulSoup(open(fullpath), 'html.parser')
text = soup.get_text()
# Make new files where the content can be saved
f = open(filename + '-new.' + ext, "x")
# Write the content to the file
f.write(text)
# Close the file
f.close()
That’s it! 👍