home
pypdf.md

Python PyPDF2

2022-02-13

Ran into an issue with PyPDF2 where trying to merge 10k documents was throwing a bad file descriptor error which made sense. I found out that there is a linux limit of only around 1000 file handles being available and then digging into the python code showed that the merge actually keeps the file handles open while they are being used. I wonder if the library could be modified to close the handles as it appends things but I’m guess there is a reason for this way of doing things.

Ultimately we decided to just do 2 merges, the first merge did 10 1000 file merges and then in the second step we did 10 merges of the files to get the document we wanted.

This was a fun little side adventure to take a look at the python library to see how it was handling the files, there was actually no reason to do it as I already suspected what was happening but being able to verify it was nice.