你說:
I extract text from a pdf file and I'd like to remove page numbers in the text. Help me write a Python script to filter lines with page numbers.
(test.txt)
1
2
3
...
92
93
...
189
190
191
To remove lines that are likely page numbers (i.e. lines that contain only a number, possibly spanning from 1 to 191 in your case), you can use a simple Python script that filters out such lines.
Here’s a basic script to do that:
pythondef remove_page_numbers(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as infile:
lines = infile.readlines()
with open(output_file, 'w', encoding='utf-8') as outfile:
for line in lines:
stripped = line.strip()
# Check if the line is just a number between 1 and 9999 (or another reasonable range)
if not (stripped.isdigit() and 1 <= int(stripped) <= 9999):
outfile.write(line)
if __name__ == "__main__":
remove_page_numbers("test.txt", "cleaned.txt")
Reads the lines from test.txt
.
Strips each line of whitespace.
If the line is only a number within a set range (like 1 to 9999), it's considered a page number and skipped.
All other lines are written to cleaned.txt
.
The upper bound 9999
depending on the size of your document.
Add more logic if page numbers appear with prefixes like "Page 1"
or in Roman numerals.
Would you like help adjusting for different page number formats or edge cases?
You can use Python’s re
module (regular expressions) to filter out lines that match a page number pattern — for example, lines that only contain digits and possibly whitespace.
Here’s how you can do that:
pythonimport re
def remove_page_numbers(input_file, output_file):
# Regular expression to match lines that contain only digits (and optional whitespace)
page_number_pattern = re.compile(r'^\s*\d+\s*$')
with open(input_file, 'r', encoding='utf-8') as infile:
lines = infile.readlines()
with open(output_file, 'w', encoding='utf-8') as outfile:
for line in lines:
if not page_number_pattern.match(line):
outfile.write(line)
if __name__ == "__main__":
remove_page_numbers("test.txt", "cleaned.txt")
^
and $
anchor the pattern to the start and end of the line.
\s*
matches any whitespace (including none).
\d+
matches one or more digits.
So ^\s*\d+\s*$
matches lines like " 23 "
, "5"
, or "\t189\n"
— i.e. lines that only consist of digits and optional whitespace.
More flexible than .isdigit()
— it allows for lines with whitespace or tabs.
Easier to customize if you want to exclude other patterns (e.g. "Page 1", Roman numerals, etc.).
Would you like to include patterns like "Page 12"
or "12 of 200"
as well?