table_pdf2csv.pl : extracting tables from PDF, saving them as CSV

by

in
  • I leave the PDF extraction bit to “pdftohtml -xml“.
  • My perl scripts tells you, at what “physical columns” text gets found within the PDF file.
  • You choose, which “physical columns” really makes sense to you as logical column starters.
  • Now you run my perl script with those few serious physical columns specified,
    and it creates a CSV file for you.
  • Per logical row a few physical rows got created.
  • If you want, you can merge cells from neighboring rows into logical cells,
    you can use LibreOffice Calc, or OpenOffice Calc, or Excel for this step.
Does this sound interesting to you?
Update 2015-05-25: Uploaded the Perl script and its Shell script wrapper to https://github.com/JochenHayek/misc.

Comments

2 responses to “table_pdf2csv.pl : extracting tables from PDF, saving them as CSV”

  1. Admir Monteiro avatar
    Admir Monteiro

    Hello, I would like to use your script for a project. Would you be kind to share the script? I would greatly appreciated it.

    Thank you

    1. Jochen Hayek avatar
      Jochen Hayek

      Of course I would like to share the script(s).
      I just updated the article itself in order to tell you, where you can find the sources. Maybe you will find them at least a little useful.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.