… the Kindle is a “roach motel” device: its license terms and DRM ensure that books can check in, but they can't check out.

For my own reference, if nothing else, I want to record all my tricks for using calibre to manage my ebooks.

DRM
Why remove DRM?
OPDS
Image Backgrounds
Editing Books

DRM

Despite the quote above, it is possible to check a DRM protected book out of a Kindle :-). See Apprentice Alf's Blog for the pointer to the scripts and calibre plugins which can do the DRM removal. Once the calibre plugins are installed according to the READMEs and the kindle key configured, it is merely a question of doing an Add Books operation in calibre to decrypt the kindle .azw files into unprotected .mobi files.

The other source of DRM protected books I buy which need DRM removal is Google Play Books. It is a slightly bigger pain in the patoot to fix them up, this blog points the way. For this I need to run a Windows XP virtual machine with a copy of Adobe Digital Editions on it. Then on the Google Play Books web site, I visit the entry for a book I have bought and go to the How To Read tab and scroll down near the bottom where I can download an epub. That doesn't actually download an epub file, instead it downloads an acsm file that points to the actual book. You can't open that file in Adobe Digital Editions, but you can drag and drop it onto Adobe Digital Editions, so do that, and it downloads the actual epub (still DRM protected). Now you can run some magic python scripts to remove the DRM and produce an unprotected epub file (which can then be added to calibre). Once it is safe in the calibre library, you can remove it and the acsm file so as not to clutter up the virtual machine more than necessary.

Those are the two book sources I use, so I'm a happy camper. No doubt there are other DRM removal tools for other book sources as well. A google search will probably turn them up.

Why remove DRM?

First I should note that DRM removal on stuff I bought and paid for so I can use it as intended (like reading it) is not illegal. I'm not advocating pirating books here.

If you use different online bookstores, you wind up being forced to organize your books primarily by where you bought them. This is the single least useful way of organizing books that I can imagine. If you remove the DRM, you can read all your books everywhere, so you can organize them by useful criteria.

With DRM protected books, you are stuck with the readers provided by that bookstore. The Google Play Books reader app, for instance, is particularly mediocre. But with DRM protected books, you are forced to use it anyway. With unprotected books, you can use any reader app that supports the book format, and you can try lots of them to find one that suits you best (for the moment, I've settled on FB Reader for android).

With unprotected books, you aren't actually stuck on one format since calibre has conversion tools which work quite well most of the time to convert from one format to another.

With unprotected books, you can edit the metadata like the authors name (so, for example, H. Rider Haggard is spelled the same in every book and properly recognized as the same author). I have even bought books with the author name completely missing from the metadata. It is nice to be able to correct this stuff so you can search properly.

Aside from the metadata, you can also edit the book itself if you want to delve into it to fix strange typos, etc.

Finally, if the publisher goes out of business or decides to withdraw the digital edition, you will never be able to get an updated copy if (for instance) your reader breaks and you need to replace it. With a DRM free copy that can be backed up and read anywhere, you don't have to worry about this.

I doubt these are the only reasons, but they are the ones that convince me it is worth the trouble to remove DRM. (Plus, trying to think of a reason to leave the DRM in place is really hard :-).

OPDS

The calibre2opds tool is very handy for making my collection of books in calibre available on my LAN for any readers that can talk to an opds catalog server (FB Reader can do that, so that is one of the reasons I like it). Once I get a book free from DRM and added to calibre, it is trivial to get it on any reader than can talk opds or just download books from a web browser (both interfaces are constructed by calibre2opds).

It was a bit tricky to get setup, so here's my procedure as an example:

To get started, you need to be positioned in the calibre ebooks directory and run the rungui.sh script to get a big GUI panel where you can fill in lots of options you won't understand at all :-). Turns out that (for me) I needed to check the boxes telling it not to make any resized or thumbnail images from the book covers. For some reason it seems to have problems with some of my covers and I get black boxes instead of scaled images.

On my Fedora desktop system where I run all this stuff, my calibre ebook folder is named /zooty/ebooks/. I run this script to update the opds index after I have added or changed books:

do-update-opds

#!/bin/bash
#
exec > /home/tom/calibre2opds.log 2>&1
cd /zooty/ebooks/
echo 1 | sh /home/tom/Calibre2opds/run.sh

For some reason the run.sh script insists on asking me a question I need to answer with 1, hence the echo piped into it. In any case, that winds up generating a _catalog subdirectory with a index.xml file and a index.html file. Opds clients (like FB Reader) need a URL that points to the index.xml file, and human browser users need to point to the index.html

So that means I need to get this directory available under the apache web server I run on my system. I do that with this bit of config file:

/etc/httpd/conf.d/calibre.conf

Alias /ebooks /zooty/ebooks
<Directory /zooty/ebooks>
   Options Indexes
   Order allow,deny
   Allow from all
</Directory>

Within my local LAN, my desktop is named zooty.my.lan, so (after restarting httpd so it sees the new config info), I can point FB Reader at http://zooty.my.lan/ebooks/_catalog/index.xml, or I can browse to http://zooty.my.lan/ebooks/_catalog/index.html from any web browser inside my LAN. I can even get to my books remotely (and securely) by using ssh forwarding of the web server.

Image Backgrounds

I'm also slowly figuring out how to tweak the ebooks to make them work better in the readers. One thing it is nice to do is convert grayscale jpg images to png images with a transparent background, so that looking at them in a reader where you have changed the background to some color other than white isn't quite so annoying. For example here's a bit of the Game of Thrones title page jpg as it renders in a page with a wheat background:

Load that image into gimp, go to Colors > Color to Alpha... and convert white to alpha, and export the result to a png file, and you get:

And you can choose a different background colors with the same image and it works just as well:

Or even use the original white background:

I've been trying to find out how to get ImageMagick to do the same thing in this thread, but the result has been kind of washed out. It would be nice to figure out how to achieve this from the command line so I could automate it in an ebook image converter tool.

Since ImageMagick isn't looking good, I've been experimenting with gimp batch mode (using code I mostly stole from here), and managed to painfully produce the same sample image from the command line:

Here's my modified script:

~/.gimp-2.8/scripts/batch-set-alpha.scm

(define (batch-set-alpha pattern )
   (let* ((filelist (cadr (file-glob pattern 1))))
      (while (not (null? filelist))
         (let* (
                 (filename (car filelist))
                  (image (car (gimp-file-load 1     ;; 1=RUN-NONINTERACTIVE 
                                       filename ;; file name
                                       filename ;; raw-file name 
                           )
                        )
                  )
                  (drawable (car (gimp-image-get-active-layer image)))
               )
               (plug-in-colortoalpha  1      ;; 1=RUN-NONINTERACTIVE
                                 image ;; image object name
                                 drawable ;; drawable layers
                                 '(255 255 255) ;; Color to make alpha (white)
               )
               (gimp-file-save 1     ;; 1=RUN-NONINTERACTIVE 
                           image  ;; image object name
                           drawable ;; Drawable layer to save
                           (string-append filename ".png") ;; filename
                           (string-append filename ".png") ;; raw-filename
               )
               (gimp-image-delete image))
               (set! filelist (cdr filelist))
      )
   )
)

Which I can run like so:

gimp -i -b '(batch-set-alpha "sample-gray.jpeg" )' -b '(gimp-quit 0)'

The script just appends .png to the input file name, so in this case, the new file will be sample-gray.jpeg.png (not wonderful, but easy to implement). The input name can be a wildcard pattern, so the batch script can do lots of files in a single run of gimp if given an argument like *.jpg

That gets past the hard part of a general tool to fix an ebook, the next thing required is a way to recognize grayscale images and only convert them. This ImageMagick info will hopefully be what I need for that.

Editing Books

The book conversion tools in calibre usually work pretty well, but if you want things perfect (especially the table of contents), you need to edit things manually. I've been learning the ins and outs of this while working on converting the largest book I have: The bundle of the 1st 4 volumes of Game of Thrones. I figure it should throw up just about every challenge I'll ever encounter, so if I can get it right, I can get anything right :-).

Removing the DRM on the kindle edition of the book leaves me with an old style mobi format file (apparently there is some new variation for newer kindle books, but I don't have that). If I use the Mobi-unpack plugin and the tidy tool to format and indent the xml and html files I can poke around and look at things and I see what looks like mostly correct stuff, especially all the links in the table of contents in the .ncx file. I can also run the kindlegen program downloaded from amazon to build a new format mobi file. It complains about some busted things, so I fix them. Mostly it doesn't seem to like blockquote tags inside paragraphs or lists, but if I whip out a couple of kludgy perl scripts to swap the tags around so the blockquote is outside, it is happy:

fix-block

#!/usr/bin/perl
#
# Read stdin, write stdout. Look for a line that starts with
# <p followed immediately by a line that starts with <blockquote
# and reverse the order of the lines.
#
# Likewise look for a line that starts with </blockquote followed
# immediately by a line that starts with </p and reverse those lines
# as well.
#
my $saveline;
my $savekind;
while (<>) {
   if (/^\s*\<p\b/) {
      if (defined($saveline)) {
         print $saveline;
      }
      $saveline=$_;
      $savekind=1;
   } elsif (/^\s*\<blockquote\b/) {
      if (defined($saveline) && ($savekind == 1)) {
         print $_;
         print $saveline;
         undef $saveline;
      } else {
         if (defined($saveline)) {
            print $saveline;
            undef $saveline;
         }
         print $_;
      }
   } elsif (/^\s*\<\/blockquote\b/) {
      if (defined($saveline)) {
         print $saveline;
      }
      $saveline=$_;
      $savekind=2;
   } elsif (/^\s*\<\/p\b/) {
      if (defined($saveline) && ($savekind == 2)) {
         print $_;
         print $saveline;
         undef $saveline;
      } else {
         if (defined($saveline)) {
            print $saveline;
            undef $saveline;
         }
         print $_;
      }
   } else {
      if (defined($saveline)) {
         print $saveline;
         undef $saveline;
      }
      print $_;
   }
}
if (defined($saveline)) {
   print $saveline;
   undef $saveline;
}

fix-ul

#!/usr/bin/perl
#
# Read stdin, write stdout. Look for a line that starts with
# <ul followed immediately by a line that starts with <blockquote
# and reverse the order of the lines.
#
# Likewise look for a line that starts with </blockquote followed
# immediately by a line that starts with </ul and reverse those lines
# as well.
#
my $saveline;
my $savekind;
while (<>) {
   if (/^\s*\<ul\b/) {
      if (defined($saveline)) {
         print $saveline;
      }
      $saveline=$_;
      $savekind=1;
   } elsif (/^\s*\<blockquote\b/) {
      if (defined($saveline) && ($savekind == 1)) {
         print $_;
         print $saveline;
         undef $saveline;
      } else {
         if (defined($saveline)) {
            print $saveline;
            undef $saveline;
         }
         print $_;
      }
   } elsif (/^\s*\<\/blockquote\b/) {
      if (defined($saveline)) {
         print $saveline;
      }
      $saveline=$_;
      $savekind=2;
   } elsif (/^\s*\<\/ul\b/) {
      if (defined($saveline) && ($savekind == 2)) {
         print $_;
         print $saveline;
         undef $saveline;
      } else {
         if (defined($saveline)) {
            print $saveline;
            undef $saveline;
         }
         print $_;
      }
   } else {
      if (defined($saveline)) {
         print $saveline;
         undef $saveline;
      }
      print $_;
   }
}
if (defined($saveline)) {
   print $saveline;
   undef $saveline;
}

After running kindlegen with no errors, I add the resulting new mobi file back into calibre in place of the old file.

If I convert this to epub format, the text is perfectly readable if you just page through the book sequentially, but the pagebreaks are all mostly gone and replaced with random ones, and the table of contents links do not work very consistently. Some are fine, but others take me to slightly (or sometimes wildly) incorrect places. Quite often all the decorative chapter head images are in the previous chapter or are followed immediately by page breaks so if the link takes you to the image, you can't see what chaper it is.

What I obviously want is a convert option that says: Hey! The .ncx file is perfect! Please believe it! Make all the things it points at the start of new chapters.

When I read the documentation for calibre conversion, it sounds to me like it is supposed to do that by default, but apparently not.

Anyway, while trying to figure out what was going on, I experimented with the debug option and read about tweaking the contents of the input directory and running the conversion starting with it. That sounded good to me, so I decided that if calibre wasn't going to use the toc.ncx file directly, I could get it to use it indirectly by adding markup to the html file to have the exact same information that is already in the .ncx file. Then I could tell the converter to match that markup to detect chapters. That idea resulted in yet another perl script:

markup-from-toc

#!/usr/bin/perl -w
#
# Since calibre conversion seems to ignore the perfectly valid .ncx file as
# far as splitting chapters properly, this script is designed to add markup
# to the html that mirrors the info in the .ncx file so I can tell the
# converter to recognize the new markup to auto generate the toc. (Sigh...)
#
# One argument is the .ncx file.
#
# This is a total hack - it does no xml parsing, but simply uses text
# matching and assumes the <text> tag always comes before the matching
# <content> tag (cheesy, but easy :-).
#
# When converting, use this XPath to match headers that are chapters,
# and Bob's 'yer Uncle!
#
# //*[(name()='h1' or name()='h2' or name()='h3') and @class = 'chapter']
#
# And to get multi-level chapters right, use these:
#
# //h:h1[re:test(@class, "chapter", "i")]
# //h:h2[re:test(@class, "chapter", "i")]
# //h:h3[re:test(@class, "chapter", "i")]

use strict;
use File::Basename;

my %files;
my %hdrs;

sub read_ncx {
   my $ncxfile = shift;
   my $fh;
   my $indent;
   my $text;
   my $file;
   my $marker;
   local $_;
   open($fh, '<', $ncxfile) || die "Cannot read $ncxfile : $!\n";
   while (<$fh>) {
      last if (/\<navMap\>/);
   }
   while (<$fh>) {
      if (/^(\s*)\<text\>(.*)\<\/text\>/) {
         if (defined($text)) {
            die "Apparently missed a matching content prior to text $2\n";
         }
         $indent = length($1);
         $text = $2;
      } elsif (/\<content\s*src\=\"(.*)\#(.*)\"/) {
         $file = $1;
         if (defined($marker)) {
            die "Apparently missed a matching text prior to marker $2\n";
         }
         $marker = $2;
         my $r;
         my $info;
         if (! exists($files{$file})) {
            $r = {};
            $files{$file} = $r;
         } else {
            $r = $files{$file};
         }
         $hdrs{$indent} = '';
         $info = {};
         $info->{'indent'} = $indent;
         $info->{'text'} = $text;
         $info->{'count'} = 0;
         $r->{$marker} = $info;
         undef($indent);
         undef($marker);
         undef($file);
         undef($text);
      } elsif (/\<\/navMap\>/) {
         last;
      }
   }
   $text = 1;
   foreach $indent (sort { $a <=> $b } (keys(%hdrs))) {
      print "indent $indent assigned header h$text\n";
      $hdrs{$indent} = "h$text";
      ++$text;
   }
}

sub fix_one_file {
   my $file = shift;
   my $newfile = "$file.new";
   my $r = shift;
   my $fh;
   my $oh;
   local $_;
   open($fh, '<', $file) || die "Cannot read $file : $!\n";
   open($oh, '>', $newfile) || die "Cannot write $newfile : $!\n";
   while (<$fh>) {
      if (/^(.*)\<a\s+id\=\"(.+)\"\>\s*\<\/a\>(.*)$/) {
         my $lead = $1;
         my $marker = $2;
         my $trail = $3;
         if (exists($r->{$marker})) {
            my $info = $r->{$marker};
            my $count = $info->{'count'};
            ++$count;
            $info->{'count'} = $count;
            if ($count > 1) {
               print "Hey! $marker is defined more than once!\n";
            } else {
               my $hdr = $hdrs{$info->{'indent'}};
               my $text = $info->{'text'};
               $_ = "${lead}<$hdr class=\"chapter\" title=\"$text\"><a id=\"$marker\"></a></$hdr>$trail\n";
            }
         }
      }
      print $oh $_;
   }
   close($fh);
   close($oh);
   unlink($file);
   rename($newfile, $file);
}

read_ncx($ARGV[0]);
my $file;
foreach $file (sort(keys(%files))) {
   fix_one_file(dirname($ARGV[0]) . "/" . $file, $files{$file});
}
foreach $file (sort(keys(%files))) {
   my $r = $files{$file};
   my $marker;
   foreach $marker (sort(keys(%$r))) {
      my $info = $r->{$marker};
      if ($info->{'count'} == 0) {
         print "Hey! marker $marker inf file $file was never defined!\n";
      }
   }
}

I ran that script to fix the html in the input directory of the debug output. I also manually fixed some other things I found like words that were run together that needed a space inserted and TOC anchors that appeared following the chapter header image instead of before it.

Then I cd'ed to the input directory and zipped up * to make a .zip file. In the edit metadata dialog I clicked the little add book icon to add that .zip file as another instance of the book. I can now convert the book to epub again and use the zip format as the input to the conversion, and it will pick up all my manual changes.

I run the conversion with these options and settings:

Structure detection
- Detect chapters XPath: //*[(name()='h1' or name()='h2' or name()='h3') and @class = 'chapter']
- Chapter mark: pagebreak
- Insert page breaks XPath: /
Table of Contents
- Force auto generated TOC: on
- Do not add detected chapters: off
- Allow duplicate links: on
- Number of links: 0 (disabled)
- Level 1 TOC: //h:h1[re:test(@class, "chapter", "i")]
- Level 2 TOC: //h:h2[re:test(@class, "chapter", "i")]
- Level 3 TOC: //h:h3[re:test(@class, "chapter", "i")]

And miraculously I get a pretty much perfect result. All the TOC entries point to the right place. All the page breaks happen on each chapter, etc.

I can now use the Epub Split plugin to turn this bundle into 4 separate books. It works pretty well. The TOC got flattened, but that doesn't seem worth fixing. The thing that does need fixing is the image at the front. It is actually the cover, but the book doesn't know that.

This is where sigil comes in handy. I open the book in sigil, double click on the 1st image file to make sure it is indeed the one I want, and then right click on the file name which bring up a menu I can use to set the image semantics. Clicking on the cover image checkbox and saving the epub file now gets me a file that knows that image is indeed the cover.

Now I can edit the metadata in calibre, downloading all the info from the internet and filling in the series name and number in each book. After that, the Modify Epub plugin can save all that metadata in the epub file itself, and I have a file almost ready to be downloaded and read on my Nexus 7 tablet.

But First! Let's finish up the script to change the .jpg files to .png files with transparent backgrounds instead of solid white. I found I needed to enhance the perl script to convert the jpg images to RBG or the color to alpha conversion would not work, but I have done that, and now the new script seems to be perfect:

jpg-to-png

#!/usr/bin/perl -w
#
# Script to scan a directory (presumed to be an exploded ebook of some kind)
# and replace all grayscale .jpeg and .jpg images with .png images that use
# transparency. The idea is that they will look identical on a white
# background, but actually take on the background color or image if you set
# something other than white in your ebook.
#
# Any text files in the directory are scanned for the names of the image
# files and the references are changed to point to the new .png files.
#
# No arguments, just run it while sitting in the exploded ebook directory.
#
use strict;

use File::Find;
use File::Basename;

my @jpgs;

my @txts;

my @opfs;

my $tmpdir="png$$";

sub check_gray {
   my $jpg = shift;
   my $fh;
   open($fh, '-|',
        'convert', $jpg,
        '(', '+clone', '-colorspace', 'gray', ')',
        '-compose', 'difference', '-composite', '-colorspace', 'gray',
        '-format', '%[maximum]', 'info:');
   my $factor = <$fh>;
   close($fh);
   chomp($factor);
   return ($factor=~/^\d+$/) && (($factor + 0) < 5000)
}

sub scanfile {
   if (-f $_) {
      # want plain files only, no directories also get rid of the leading ./
      my $dotfree = substr($File::Find::name,2);
      if ((/\.jpeg$/) || (/\.jpg$/)) {
         # This should be one of the .jpg files we are looking for.
         if (check_gray($_)) {
            push(@jpgs, $dotfree);
         } else {
            print "$dotfree is not grayscale\n";
         }
      } elsif (-T $_) {
         # This is some other file in the book which might contain
         # the name of one of the files we are going to change.
         push(@txts, $dotfree);
         if ($_=~/\.opf$/) {
            push(@opfs, $_);
         }
      }
   }
}

find(\&scanfile, ".");

if (scalar(@jpgs) == 0) {
   print "No jpg images found.\n";
   exit(0);
}

sub fix_opfs {
   # Any .opf files may well have .png file references which claim
   # to have mime type image/jpeg. This final fix replaces those
   # image/jpeg mime types with image/png.
   my $opf = shift;
   my $temp = "tempopf.$$";
   my $inf;
   my $outf;
   my $update=0;
   local $_;
   if (open($inf, '<', $opf)) {
      if (open($outf, '>', $temp)) {
         while (<$inf>) {
            if (/.*item.*\.png.*image\/jpeg/) {
               s/image\/jpeg/image\/png/;
               $update=1;
            }
            print $outf $_;
         }
         close($outf);
      }
      close($inf);
   }
   if ($update) {
      unlink($opf);
      rename($temp,$opf);
   } else {
      unlink($temp);
   }
}

# Link all the grayscale jpg files into a temp directory

mkdir($tmpdir);
my $f;
foreach $f (@jpgs) {
   my $b = basename($f);

   # The gimp color to alpha filter only works on RGB images so make sure
   # any grayscale images get converted to RGB by using the convert command
   # to copy the RGB versions to the temp directory.

   system("convert", "$f", "-depth", "8", "-type", "TrueColor", "$tmpdir/$b");
}

# Convert all the files in the temp directory

chdir($tmpdir);
system("gimp", '-i', '-b', '(batch-set-alpha "*" )', '-b', '(gimp-quit 0)');
chdir("..");

# Now rename all the temp files

my @replacepats;

foreach $f (@jpgs) {
   my($b,$d,$s) = fileparse($f, qr/\.[^.]*/);
   if (-f "$tmpdir/${b}$s.png") {
      link("$tmpdir/${b}$s.png", "${d}$b.png");
      push(@replacepats, $f, "${d}$b.png");
      unlink($f);
      print "$f replaced with ${d}$b.png\n";
   } else {
      print "Hey! $f was not converted to a .png file!\n";
   }
}
system("rm -rf $tmpdir");

push(@replacepats, '--', @txts);

system("replace", @replacepats);

if (scalar(@opfs) > 0) {
   foreach $f (@opfs) {
      fix_opfs($f);
   }
}

I've been running it on my existing books by manually exploding the epub files with unzip, running the script while sitting in the OEBPS subdirectory, then zipping the modified directory back up into the epub file again. You should be able to use it with the tweak book tool as well (but don't try it in the DEBUG/input directory since calibre seems to eliminate the alpha layer during conversion and you wind up with just black blocks).

The effect is to transform this annoying look:

Into this much cleaner and more natural look:

These are reduced size screenshots taken on my Nexus 7 with FB Reader showing the book before and after I downloaded the new version.

I think my work here is (mostly) done. I actually got through the entire massive ebook getting it converted to 4 separate epub files that look good in the reader. One last annoying effect I'll need to tweak by hand shows up in the leading T where the html tries to make it a large letter. The FB Reader app mostly ignores that stuff, but it does wind up looking like a separate word when FB Reader renders the page. I'm sure I'll be finding irritants like that forever, but these automated scripts got things nearly perfect.

Page last modified Sat Sep 1 15:40:46 2012