wp.jochen.hayek.name/blog-en

table_pdf2csv.pl : extracting tables from PDF, saving them as CSV

—

by

in CSV

I leave the PDF extraction bit to “pdftohtml -xml“.
My perl scripts tells you, at what “physical columns” text gets found within the PDF file.
You choose, which “physical columns” really makes sense to you as logical column starters.
Now you run my perl script with those few serious physical columns specified,
and it creates a CSV file for you.
Per logical row a few physical rows got created.
If you want, you can merge cells from neighboring rows into logical cells,
you can use LibreOffice Calc, or OpenOffice Calc, or Excel for this step.

Does this sound interesting to you?

Update 2015-05-25: Uploaded the Perl script and its Shell script wrapper to https://github.com/JochenHayek/misc.

Comments

2 responses to “table_pdf2csv.pl : extracting tables from PDF, saving them as CSV”

Admir Monteiro

2015-04-08

Hello, I would like to use your script for a project. Would you be kind to share the script? I would greatly appreciated it.

Thank you

Reply
1. Jochen Hayek
  
  2015-04-09
  
  Of course I would like to share the script(s).
  I just updated the article itself in order to tell you, where you can find the sources. Maybe you will find them at least a little useful.
  
  Reply

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.