Remove HTML from files with Python

Got a file with some HTML you want to remove?

All you need to do is parse the file with a Python script, using the Python library BeautifulSoup.

If that sounds unfamiliar to you, you should watch Corey Schafers beginner video “Python Tutorial: Web Scraping with BeautifulSoup and Requests“.

If you are familiar with Beautiful Soup, basically it is just a matter of using soup.get_text.

Requirements

Install Python 3 or newer…
Install Beautiful Soup 4

Python script

Here is the whole script we are using. You will need to edit some of it.

Replace C:/some_folder with the folder containing the files you want to remove HTML from.
Replace ass with the file type extension for the files you want to remove HTML from.
Run the script from the same folder as the files.

# Import OS so we can write to files
import os

# Import Beautiful Soup 4 so we can parse HTML
from bs4 import BeautifulSoup

# Set the path where the target files are located
path = r'C:/some_folder'

# Set the file extension to look for
ext = 'ass'

# Start a loop - (for each file in the path do...)
for filename in os.listdir(path):

	# Set which file types to look for
	if filename.endswith(ext):

		# Get the file name without the extension
		fullpath = os.path.join(path, filename)

		# Get the file path including the file and the file extension
		filename = os.path.splitext(os.path.basename(filename))[0]
		
		# Parse the file with Beautiful Soup
		soup = BeautifulSoup(open(fullpath), 'html.parser')
		text = soup.get_text()

		# Make new files where the content can be saved
		f = open(filename + '-new.' + ext, "x")

		# Write the content to the file
		f.write(text)

		# Close the file
		f.close()

That’s it! 👍

Requirements

Python script

Leave a ReplyCancel Reply