Remove HTML from files with Python

Got a file with some HTML you want to remove?

All you need to do is parse the file with a Python script, using the Python library BeautifulSoup.

If that sounds unfamiliar to you, you should watch Corey Schafers beginner video “Python Tutorial: Web Scraping with BeautifulSoup and Requests“.

If you are familiar with Beautiful Soup, basically it is just a matter of using soup.get_text.

Requirements

Python script

Here is the whole script we are using. You will need to edit some of it.

  • Replace C:/some_folder with the folder containing the files you want to remove HTML from.
  • Replace ass with the file type extension for the files you want to remove HTML from.
  • Run the script from the same folder as the files.
# Import OS so we can write to files
import os

# Import Beautiful Soup 4 so we can parse HTML
from bs4 import BeautifulSoup

# Set the path where the target files are located
path = r'C:/some_folder'

# Set the file extension to look for
ext = 'ass'

# Start a loop - (for each file in the path do...)
for filename in os.listdir(path):

	# Set which file types to look for
	if filename.endswith(ext):

		# Get the file name without the extension
		fullpath = os.path.join(path, filename)

		# Get the file path including the file and the file extension
		filename = os.path.splitext(os.path.basename(filename))[0]
		
		# Parse the file with Beautiful Soup
		soup = BeautifulSoup(open(fullpath), 'html.parser')
		text = soup.get_text()

		# Make new files where the content can be saved
		f = open(filename + '-new.' + ext, "x")

		# Write the content to the file
		f.write(text)

		# Close the file
		f.close()

That’s it! 👍

Leave a Reply

Your email address will not be published. Required fields are marked *