Already Seen? or a rant about emerging file types

Already Seen?

Yesterday I’ve encountered a rather interesting problem. I were visiting Alex, and in conversation he complained about a new phenomena in book and magazine distribution – LizardTech djvu file format . He downloaded an issue of “Penthouse”, and wanted to extract a particular page out of it, and print it onto a sticker that he could apply to face of burned CD. However the file he had was in djvu format, and the stand-alone viewer he was using didn’t let him print. I agreed to take a look at the problem for him, and whipped out my trusty iBook….

In any case, Alex was viewing djvu files using OpenDjVu viewer for Windows. While that viewer is a stand-alone (and one can wonder about the purpose of “easy navigation with one hand from keyboard” considering that magazines like “Penthouse” and “Playboy” are distributed in this format). one of the shortcomings that it has is lack of print support. And ofcourse one can’t export the djvu file as anything.

After looking at LizardTech web site, I’ve noticed that the only Mac OS X implementation that they support is a Safari Plugin (For the reference it installs into /Library/Internet Plug-Ins

stany@fiona:/Library/Internet Plug-Ins[01:53 PM]$ ls -lad NPD*
-rwxr-xr-x  1 stany  admin  8966  2 Nov 01:51 NPDJVU
drwxr-xr-x  3 stany  admin   102  2 Nov 01:51 NPDjVu.plugin
stany@fiona:/Library/Internet Plug-Ins[01:53 PM]$ 

).

I subscribe to the orignal Mac software distribution model where all the bits and pieces needed by a particular program should be available inside a single directory tree, making upgrading and pruning file system simple (getting rid of a program should be rm -rf /opt/program_name or rm -rf /Applications/Program Name.app. This method was field tested over and over again on literally tens of Solaris systems I’ve administred, allowing me to compile once, and copy over everywhere, and while somewhat wasteful on disk space, seems to be rather effective), so I were reluctant to install the plug-in.

No problem, there is a djvulibre implementation that is designed to run under X. Downloaded that, looked through INSTALL file. It depends on qt-X11 and libtiff. Compiled libtiff. Remembered that there is such a thing now as qt-mac, and downloaded that. Attempted to follow the installation instructions for qt-mac (uncompress into /Developer/, rename qt-mac-3.x.x.x to /Developer/qt/, run configure, followed by make). Took over an hour and generated tons of object files, with no end in sight. Realized that I don’t really need the fancy graphic utilities that are part of djvu-libre, and all I need is djvups. Compiled that.

Run it on a 5.7 meg .djvu file, redirecting the output into new file, that is supposed to be PostScript. End result have been running for 15 minutes and over a gigabyte, until it run out of disk space. Cursed a lot. Copied the .djvu file to a Solaris system with much more RAm and disk space, and let it run there for a while. Half an hour later I had a 1.6 gigabyte PS file. Fed the resulting PS file into Adobe Distiller 4.0 that came with Adobe Frame Maker 6.0. It run for over 2 hours on a 400 Mhz USII CPU, and used up over 450 megabytes of real RAM. End result was 7.5 megabyte PDF file.

I’ve opened the resulting file, and attempted to zoom onto some text, only to find out that the result is not readable. At this point I suspect that the fault is with Adobe Distiller, and that I should have specified that I want print resolution, and not default 72dpi.

Example of pixilization in 7 meg PDF output at 200% magnification

This is what it looked like at 200% magnification

By this point it’s 2 in the morning, and me and Alex are both cursing a lot.

So I eventually gave in and installed the Safari plugin and restarted Safari. In README for it, there is a little note that says that viewing local djvu files is not supported. No problems, copy the files into Apache directory, access it over HTTP. No workie, Safari is showing a bunch of binary garbage. More cursing. At this point I go into Help-> Installed Plug-ins, and notice that while the djvu plugin is installed and enabled, it is set as a handler for only a number of particular mime types (Subject of another RANT – Who still uses MIME types that are essentially extension based instead of using something similar to file(1) database to identify what kind of file one is handling regardless of extension, resource fork, etc? Get with the program, folks). So a quick edit of httpd.conf to add AddType image/x-dejavu .djvu, and a quick apacehctl restart later, the plugin was displaying djvu files.

At this point printing was working. So I just printed the thing to PDF using normal Panther PDF support. Another 10 minutes later, I had a 215 meg file. Opening it, and zooming in was actually showing the text in a readable way, so I started up Windows File Sharing, and copied it over to his system.

Example of pixilization in 215 meg PDF output at 200% magnification

This is what a 215 meg PDF output generted from Panther looked like at 200% magnification

So good 4 hours later Alex had his 5.7 meg djvu file as a 215 meg PDF file that he could actually use.

Here are the file sizes:

-rwxr--r--   1 stany    staff    5616972 Nov  2 00:37 penthouse_11.djvu
Original
-rw-r--r--   1 stany    staff    1608552487 Nov  2 00:59 penthouse_11.ps 
Generated with djvups with above as input
-rw-r--r--   1 stany    staff    7686172 Nov  2 03:11 penthouse_11.pdf
Generated by Adobe Distiller
-r-xr-xr-x  1 root      wheel  221127535  1 Nov 21:22 penthouse_11.pdf
Output of Safari djvu viewer printed to PDF

So here is the rant part:

If you create a new file format, no matter how cool you think it is, and how new and advanced it is, please consider that there might be people who would want to actually do things to data in the file that you never anticipated. So even if you are 100% sure that your file format is the best thing since chocolate and everyone on the planet should adopt it, please keep in mind that there might be some specialized applications written to solve a particular problem, and that those applications can do things that you never anticipated, and support file formats that are existing standards, and not emergent standards. Thus, if you want your file format to be successful, do take care in writing and making publically available filters that not only convert to your file format, but from it as well. This way people will not perceive that you are attempting to lock them into your design, but instead that your idea has genuine merit, and technological innovation.

It seems that djvu has some benefits, but primarily folks prefer it for distribution of images as it generates smaller files then .PDF. However that is not exactly true. After some research on the internet I believe that djvu generates smaller files only if the source material is a series of images. The moment you run any sort of OCR on the images, and distill that, you will get a significantly smaller PDF file.

So djvu is a file format for folks who either need to distribute high quality image galleries as single files, or who do not have access to OCR software (or do not want to proofread). It is also not searchable or indexable.

Obviously neither the folks at LizardTech nor numerous open and free and libre implementators of the djvu support or even think about such a novel and advanced concept as export to JPG page by page. The resolution that is generated by djvu is obviously high enough that the generated images can be successfully OCRed using commercial off the shelf software. However right now in order to do something like that, one has to first dump djvu to PDF and then convert PDF to jpg page by page, with introduction of noise and compression artifacts at the extra step. Lack of generally available tools to modify the files seems to also hinder djvu adoption as a mainstream file format.

It is well known that VHS adoption was widely won due to efforts of pornographers, who chose simpler if technologically inferior VHS over Beta. It seems that LizadTech is betting on the similar vector, as currently it is the warez community that seems to utilize djvu file format, due to it’s smaller size, and thus lighter load on their servers. While I am not 100% certain that this is an ideal business plan, there of course might be some merit to it.

To quote MasaManiA.com, “You must agree my logical thinking and natural fear” (Note: Potentially NSFW but generally hilarious never the less).