Remove page numbers script

I extract text from a pdf file and I'd like to remove page numbers in the text. Help me write a Python script to filter lines with page numbers. (test.txt) 1 2 3 ... 92 93 ... 189 190 191

To remove lines that are likely page numbers (i.e. lines that contain only a number, possibly spanning from 1 to 191 in your case), you can use a simple Python script that filters out such lines.

Here’s a basic script to do that:

python
def remove_page_numbers(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as infile:
        lines = infile.readlines()

    with open(output_file, 'w', encoding='utf-8') as outfile:
        for line in lines:
            stripped = line.strip()
            # Check if the line is just a number between 1 and 9999 (or another reasonable range)
            if not (stripped.isdigit() and 1 <= int(stripped) <= 9999):
                outfile.write(line)

if __name__ == "__main__":
    remove_page_numbers("test.txt", "cleaned.txt")

How it works:

Reads the lines from test.txt.
Strips each line of whitespace.
If the line is only a number within a set range (like 1 to 9999), it's considered a page number and skipped.
All other lines are written to cleaned.txt.

You can adjust:

The upper bound 9999 depending on the size of your document.
Add more logic if page numbers appear with prefixes like "Page 1" or in Roman numerals.

Would you like help adjusting for different page number formats or edge cases?

How do I instead use regular expression to filter the page number line by line?

You can use Python’s re module (regular expressions) to filter out lines that match a page number pattern — for example, lines that only contain digits and possibly whitespace.

Here’s how you can do that:

python
import re

def remove_page_numbers(input_file, output_file):
    # Regular expression to match lines that contain only digits (and optional whitespace)
    page_number_pattern = re.compile(r'^\s*\d+\s*$')

    with open(input_file, 'r', encoding='utf-8') as infile:
        lines = infile.readlines()

    with open(output_file, 'w', encoding='utf-8') as outfile:
        for line in lines:
            if not page_number_pattern.match(line):
                outfile.write(line)

if __name__ == "__main__":
    remove_page_numbers("test.txt", "cleaned.txt")

Explanation of the regex:

^ and $ anchor the pattern to the start and end of the line.
\s* matches any whitespace (including none).
\d+ matches one or more digits.
So ^\s*\d+\s*$ matches lines like " 23 ", "5", or "\t189\n" — i.e. lines that only consist of digits and optional whitespace.

Benefits:

More flexible than .isdigit() — it allows for lines with whitespace or tabs.
Easier to customize if you want to exclude other patterns (e.g. "Page 1", Roman numerals, etc.).

Would you like to include patterns like "Page 12" or "12 of 200" as well?

ChatGPT 可能會發生錯誤。請查核重要資訊。

你說：

ChatGPT 說：

How it works:

You can adjust:

你說：

ChatGPT 說：

Explanation of the regex:

Benefits: