Experimental JBIG2 Driver for pdfTeX

News (2006-05-04)

Now an open-source JBIG2 encoder is available :-) Without this the JBIG2 driver for pdftex didn't make much sense. Therefore i have updated the experimental driver to work basically with pdftex-1.40-beta-20060213. Here are the sources:

jbig2-pdftex-060531.tgz The latest JBIG2 driver code for pdftex.

Introduction

This is an informal write-up of my private/spare-time fiddling with JBIG2 file inclusion in pdfTeX (it's for fun). I am slowly writing an experimental driver, which shall allow to include native JBIG2 images from JBIG2 files into a PDF file created by pdfTeX.

PdfTeX can already now include PDF-shrink-wrapped JBIG2 files (as PDF files), but most likely some compression efficiency will be lost by this approach, as partial inclusion of such JBIG2/PDF files does not care for the minimum set of required page0 segments (see below). Therefore it might still be senseful to experiment with native JBIG2 file inclusion.

Status Overview

The driver, crude and experimental as it is, allows to include through the standard \pdfximage method JBIG2 images from single-page JBIG2 files, and multiple images from multi-page JBIG2 files. A given page is accessed by the page option of the \pdfrefximage command (with single-page JBIG2 files, this should be page 1). The JBIG2 files may or may not contain page0 segments; the driver autonomously puts only the required page0 information from the JBIG2 files into the PDF file. The number of included JBIG2 image files is limited by the computer's memory.

You find here an example PDF file generated by pdfTeX with the JBIG2 driver, which includes all three images from the JBIG2 datastream example (Annex H of he JBIG2 draft standard), into one PDF file.

The driver is missing any installation procedure, it's fun in progress. More see below.

The JBIG2 Standard

Adobe Systems have defined a new filter /JBIG2Decode in their newest PDF format, version 1.4, which allows decoding of image data after the JBIG2 standard. It seems that this feature is first supported by Adobe Acroread version 5.0.

The JBIG2 encoding is for bi-level images only, e. g. scanned texts, where it is told to give very high lossy or lossless compression ratios. It is especially well geared towards compression of multi-page documents, by using a global page with information commonly used by all pages. This rather new standard is worked out by the JBIG Committee. The latest JBIG2 draft standard is available from here as PDF-file.

JBIG2 Test Data Streams

I don't yet have any program, which would produce JBIG2 files. But some sample data streams are available from here. And there is a small but working ASCII-JBIG2 example in section 3.3.6 of the PDF reference, which can be typed in and binarized, e. g. by some awk tool. It produces two letters 'C', stacked over each other.

PdfTeX Issues

JBIG2 files contain one or several pages (images), and optionally one global page (page0). This global page contains decoding information, which can be used/referenced by one or more than one page; this multiple use of the same information for several images is one reason for the high compression ratio achievable by the JBIG2 standard.

Pages are made up by segments; each segment has a page association. Segments reference other segments by their segment number. When an image is requested from a JBIG2 file by pdfTeX through the JBIG2 driver, and this image references some page0 segments, these page0 segments must also be included into the PDF file. This is different to the way how e. g. JPEG files are handled:

One JBIG2 file may contain several images (pages).
Requesting an image from a JBIG2 file requires telling which page.
The selection of required page0 segments depends on the set of requested pages.

This multi-page approach is similar to PDF file inclusion, in fact the same page option can be used.

The Experimental Driver

The approach I currently use is to have all bookkeeping about required page0 segments only within the JBIG2 driver. The main reasons are:

Keep the interface to pdfTeX narrow; try not to disturb pdfTeX.
Use C as language.

The current driver registers and parses a new JBIG2 file when the first image is requested from it in the read_jbig2_info() phase, and per-file as well as per-image informations (width, height, etc.) are stored in structures. At the end of the file scan, search trees for pages and page0 segments are built, to allow quick finding of pages in the later write phase, and then to allow quick referencing of page0 segments. Further requests through read_jbig2_info() from the same file (a file with the same name) do not re-read the file, but use the stored information instead.

In the write_jbig2() phase of an image, the segments of the requested page are written out, and referenced page0 segments are recursively marked as required (but the page0 segments are not yet written). This gives the minimum amount of page0 segments in the PDF file, if only a subset of pages are extracted from a given JBIG2 file.

If a JBIG2 file does not contain page0 segments, no JBIG2Globals PDF object is created. If there are page0 segments in the JBIG2 file, a JBIG2Globals object number is reserved in the write_jbig2() phase. If at the end of image inclusion from that JBIG2 file no page0 segment has been referenced, an empty JBIG2Globals stream will be the result.

No JBIG2Globals stream is written out until the very end; the user might still include more images from the same JBIG2 file, which also might change the number of required page0 segments and therefore the JBIG2Globals object of that JBIG2 file. Only when the PDF file is finalized, there is a single call from the pdfTeX core (chunk 765) to the function flushjbig2page0objects(). This function steps through the list of accessed JBIG2 files and writes out any pending JBIG2Globals PDF objects.

I have experimented with JBIG2 image inclusion in PDF streams generated by program pdfTeX as part of the teTeX bundle, using the currently freshest beta version (teTeX-src-beta-20021225.tar.gz). The experimental driver is writejbig2.c. This I put into the pdftexdir directory of the teTeX tree on my Linux PC (debian 3.0r1), together with the other drivers. A few other files required patching, just to add jbig2 things similarly to the already existing jpeg things. Here is the list of new/patched files, all in the subdirectory pdftexdir:

writejbig2.c The JBIG2 driver code.
writejbig2.h The header file for the JBIG2 driver.
writeimg.c JBIG2 additions to readimage(), writeimage(), and deleteimage().
image.h Added struct JBIG2_IMAGE_INFO, macro IMAGE_TYPE_JBIG2, and macro jbig2_ptr(N).
Makefile Added target writejbig2.o. These changes must be done by hand, after the Makefile it is automatically generated by configure.
File pdftex.ch requires to include a parameterless function flush_jbig2_page0_objects, preferably into chunk 765.
File ptexlib.h requires a declaration: extern void flushjbig2page0bjects(void);

The JBIG2 pictures must have the ending '.jb2' or '.jbig2'.

Experimenting

I could test the driver on the about 28 available JBIG2 files, which are of type:

non-striped sequential
non-striped random-access
striped random-access

The driver could process all three types.

The fresh Linux Acroread, Version x86 linux 5.05 Apr 25 2002, still chokes on one file, 042_13.jbig2, from the above mentioned set with info 'Bad error code'. No idea why.

Xpdf 2.01 has minor problems with some files (reported). Origin of problems is unclear.

TODOs

Replace all the hacked list/tree stuff by a professional solution like GNU libavl, or tavl from the C Users Group. AVL trees will allow to sort things already during the read_jbig2_info() phase.
Determining the segment data length (section 7.2.7 of the JBIG2 draft standard) by detecting two-byte sequences is not supported.
Improve error checking. There is only marginal check of .jbig2 file validity. Program pdfTeX might crash completely on a corrupted file.
Free all structures, which are not required anymore. Currently just nothing is freed until program end.
Any idea about this: Currently the JBIG2 file is opened several times in various phases, and info read from there. Alternative would be to read the entire JBIG2 file into a set of structures only once and work from there. This is much faster, but requires much more memory.

End Remark

Help, advice, critics greatly welcome!

This page first put online 8 January 2003. Here is a link to my first try on a JBIG2 driver.