Private Homepage of Hartmut Henkel

Experimental JBIG2 Driver for pdfTeX

News (2006-05-04)

Now an open-source JBIG2 encoder is available :-) Without this the JBIG2 driver for pdftex didn't make much sense. Therefore i have updated the experimental driver to work basically with pdftex-1.40-beta-20060213. Here are the sources:

Introduction

This is an informal write-up of my private/spare-time fiddling with JBIG2 file inclusion in pdfTeX (it's for fun). I am slowly writing an experimental driver, which shall allow to include native JBIG2 images from JBIG2 files into a PDF file created by pdfTeX.

PdfTeX can already now include PDF-shrink-wrapped JBIG2 files (as PDF files), but most likely some compression efficiency will be lost by this approach, as partial inclusion of such JBIG2/PDF files does not care for the minimum set of required page0 segments (see below). Therefore it might still be senseful to experiment with native JBIG2 file inclusion.

Status Overview

The driver, crude and experimental as it is, allows to include through the standard \pdfximage method JBIG2 images from single-page JBIG2 files, and multiple images from multi-page JBIG2 files. A given page is accessed by the page option of the \pdfrefximage command (with single-page JBIG2 files, this should be page 1). The JBIG2 files may or may not contain page0 segments; the driver autonomously puts only the required page0 information from the JBIG2 files into the PDF file. The number of included JBIG2 image files is limited by the computer's memory.

You find here an example PDF file generated by pdfTeX with the JBIG2 driver, which includes all three images from the JBIG2 datastream example (Annex H of he JBIG2 draft standard), into one PDF file.

The driver is missing any installation procedure, it's fun in progress. More see below.

The JBIG2 Standard

Adobe Systems have defined a new filter /JBIG2Decode in their newest PDF format, version 1.4, which allows decoding of image data after the JBIG2 standard. It seems that this feature is first supported by Adobe Acroread version 5.0.

The JBIG2 encoding is for bi-level images only, e. g. scanned texts, where it is told to give very high lossy or lossless compression ratios. It is especially well geared towards compression of multi-page documents, by using a global page with information commonly used by all pages. This rather new standard is worked out by the JBIG Committee. The latest JBIG2 draft standard is available from here as PDF-file.

JBIG2 Test Data Streams

I don't yet have any program, which would produce JBIG2 files. But some sample data streams are available from here. And there is a small but working ASCII-JBIG2 example in section 3.3.6 of the PDF reference, which can be typed in and binarized, e. g. by some awk tool. It produces two letters 'C', stacked over each other.

PdfTeX Issues

JBIG2 files contain one or several pages (images), and optionally one global page (page0). This global page contains decoding information, which can be used/referenced by one or more than one page; this multiple use of the same information for several images is one reason for the high compression ratio achievable by the JBIG2 standard.

Pages are made up by segments; each segment has a page association. Segments reference other segments by their segment number. When an image is requested from a JBIG2 file by pdfTeX through the JBIG2 driver, and this image references some page0 segments, these page0 segments must also be included into the PDF file. This is different to the way how e. g. JPEG files are handled:

This multi-page approach is similar to PDF file inclusion, in fact the same page option can be used.

The Experimental Driver

The approach I currently use is to have all bookkeeping about required page0 segments only within the JBIG2 driver. The main reasons are:

The current driver registers and parses a new JBIG2 file when the first image is requested from it in the read_jbig2_info() phase, and per-file as well as per-image informations (width, height, etc.) are stored in structures. At the end of the file scan, search trees for pages and page0 segments are built, to allow quick finding of pages in the later write phase, and then to allow quick referencing of page0 segments. Further requests through read_jbig2_info() from the same file (a file with the same name) do not re-read the file, but use the stored information instead.

In the write_jbig2() phase of an image, the segments of the requested page are written out, and referenced page0 segments are recursively marked as required (but the page0 segments are not yet written). This gives the minimum amount of page0 segments in the PDF file, if only a subset of pages are extracted from a given JBIG2 file.

If a JBIG2 file does not contain page0 segments, no JBIG2Globals PDF object is created. If there are page0 segments in the JBIG2 file, a JBIG2Globals object number is reserved in the write_jbig2() phase. If at the end of image inclusion from that JBIG2 file no page0 segment has been referenced, an empty JBIG2Globals stream will be the result.

No JBIG2Globals stream is written out until the very end; the user might still include more images from the same JBIG2 file, which also might change the number of required page0 segments and therefore the JBIG2Globals object of that JBIG2 file. Only when the PDF file is finalized, there is a single call from the pdfTeX core (chunk 765) to the function flushjbig2page0objects(). This function steps through the list of accessed JBIG2 files and writes out any pending JBIG2Globals PDF objects.

I have experimented with JBIG2 image inclusion in PDF streams generated by program pdfTeX as part of the teTeX bundle, using the currently freshest beta version (teTeX-src-beta-20021225.tar.gz). The experimental driver is writejbig2.c. This I put into the pdftexdir directory of the teTeX tree on my Linux PC (debian 3.0r1), together with the other drivers. A few other files required patching, just to add jbig2 things similarly to the already existing jpeg things. Here is the list of new/patched files, all in the subdirectory pdftexdir:

The JBIG2 pictures must have the ending '.jb2' or '.jbig2'.

Experimenting

I could test the driver on the about 28 available JBIG2 files, which are of type:

The driver could process all three types.

The fresh Linux Acroread, Version x86 linux 5.05 Apr 25 2002, still chokes on one file, 042_13.jbig2, from the above mentioned set with info 'Bad error code'. No idea why.

Xpdf 2.01 has minor problems with some files (reported). Origin of problems is unclear.

TODOs

End Remark

Help, advice, critics greatly welcome!

This page first put online 8 January 2003. Here is a link to my first try on a JBIG2 driver.