Blog

  • on PDF

    Nowadays on the Web or through e-mail you are getting more and more PDF files as electronic documents instead of documents on paper.

    Roughly spoken PDF documents are expected to display the same way on every computer
    platform (as opposed to documents created by usual word processing
    software). This is regarded a major advantage of PDF.

    PDF vs. fonts vs. platform (in)dependence vs. resizability/scalability

    Whenever a PDF document makes use of  outline fonts and stroke fonts as opposed to bitmap fonts (see the Wikipedia article on computer fonts!), you are able to resize resp. rescale your document to different sizes without suffering from the loss of quality of the fonts used. This is in general considered another major advantage.
    But computer fonts are not in the public domain, so on every computer platform, different available fonts are used for PDF documents.

    So what can we do against platform dependency stemming from fonts?

    • Include the fonts: that’s the approach used by PDF/A.
      PDF/A is especially employed, where documents need to be available even after many years in the context of document archives.
      The major downside of this approach: PDF/A documents are much, much bigger than usual PDF documents, storing the fonts within them takes a lot space.
    • Another approach is to render text and fonts into ready-made bitmaps.
      Of course documents of this kind display best with a 1:1 relationship of the pixels in your documents to the pixels on your screen resp. on your printer output.
      Any resizing / rescaling results in pour quality.
      And I think you understand this very well: there is not text (as text) at all left in your PDF document, and you will not be able to extract any text from such a document.

    Now you know: different kinds of PDF documents come with different advantages and also disadvantages.

    I am interested here in PDF documents, that are not rendered into “one bitmap per page”, but which rather contain the source document’s text. Extracting that text simply as text is more or less an easy piece of cake, and there already exists software for
    this purpose.

    PDF basics

    Before I dive with you into what information we want to extract from PDF files, I want to explain PDF a little.

    I am honestly not too deep into PDF, but I
    understand it as an advanced and optimized version of PostScript. My little
    knowledge of PostScript is (please find a slightly lengthier version here in
    the Wikipedia article!):

    • It’s a stack-based programming language like Forth using reverse
      Polish notation.
    • It has data structures like arrays and dictionaries, but nothing
      more abstract than that.
    • Subprograms are called resp. regarded as operators of the stack
      machine.
    • Some relevant information details may be coded into operator names.
    • Some other relevant information details (like page numbers) are coded into
      comment lines, see the article on PostScript Document
      Structuring Conventions
      . I have no clue, what corresponds to that
      in PDF. Maybe there are language elements for that.

    Now you have an idea of how PDF looks like, and you may have a vague idea, of what is possible with PDF and what isn’t.

  • Eat Pray Love (2010) – IMDb

    Eat Pray Love (2010) – IMDb:

    A married woman realizes how unhappy her marriage really is, and that
    her life needs to go in a different direction. After a painful divorce,
    she takes off on a round-the-world journey to “find herself”.

    The married woman is being portrayed by Julia Roberts, so even if there was far too much of that self-finding-thing in that movie for me, I always enjoy looking at her – apart from when she looks sad, because I find her ugly than – but I really like her smile.

    Javier Bardem played her Brazilian lover (although he actually is a Spaniard), he even spoke some Portuguese there, and he did a good and serious job.

    The nicest music in the movie (IMHO) is actually also Brazilian, and I loved it (you can of course also find it on YouTube, but no nice one with Bebel Gilberto performing):

    There were a few scenes, that really got me crying, e.g. the farewell scene between the Brazilian father and his son.

    This was my Saturday night movie at the CineStar Original movie theatre at the Sony Center in Berlin. I really enjoyed it – but for the pictures and the music.
    The story and the the main character are truely sick, and this is how one of the reviewers on IMDb ended his text:

    Do not see this movie and encourage others to avoid it like the plague!

    He titled “American Films Continue to Glorify Female Borderline Personality Disorder“, and I think, I agree to him.

  • Them (2006) – IMDb

    Them (2006) – IMDb

    Horror | Mystery | Thriller

    Watched this French-Rumanian scary movie on Saturday / Sunday night. It really took hold of me.

  • e-mail addresses and “sub-addressing” and “plus addressing” resp. “plussing”

    e-mail messages addressing John.Doe+MailingListName@gmail.com are meant to actually go to johndoe@gmail.com, in other words:

    • “.” characters actually get removed for computing the real mail box
    • everything starting the “+” character and going to the “@” character (not including the latter) gets removed entirely

    On the recipient side, software can check on plussing and may come to decisions based on the string between the “+” and the “@”.

    Yes, gmail and hotmail and posteo do support plussing. GMX does not support plussing.

    On my domains I have a catch-all rule for e-mail forwarding aliases, and procmail rules help me with the checks.

    When I will get around to it, I will write here under “e-mail”, how I make use of IMAP, procmail, and fetchmail.

    Update 2023-04-04: There is another variant of sub-addressing: if you own the full right side of the “@”, you can also use the full left side of the “@” as a “catch all”, i.e. mail_jh@John.Doe.name and mail_aw@John.Doe.name can be John Doe’s dedicated mail address for me (Jochen Hayek) resp. “aw” (like Alex Winner).