- I leave the PDF extraction bit to “pdftohtml -xml“.
- My perl scripts tells you, at what “physical columns” text gets found within the PDF file.
- You choose, which “physical columns” really makes sense to you as logical column starters.
- Now you run my perl script with those few serious physical columns specified,
and it creates a CSV file for you. - Per logical row a few physical rows got created.
- If you want, you can merge cells from neighboring rows into logical cells,
you can use LibreOffice Calc, or OpenOffice Calc, or Excel for this step.
Does this sound interesting to you?
Update 2015-05-25: Uploaded the Perl script and its Shell script wrapper to https://github.com/JochenHayek/misc.
Leave a Reply