In the last part, we defined some Haskell foreign functions for the libarchive library. Now we'll get down to the actual business of translating a few Racket routines into Haskell, redesign them a bit, and ultimately show how lazy IO can be a very nice way to interact with this library.
Let's begin by translating the archive-read-current-file
function that we saw previously:
(define (archive-read-current-file archive entry-size) (let ([buffer (malloc 'atomic entry-size)]) (memset buffer 0 entry-size) (archive_read_data archive buffer entry-size) (make-sized-byte-string buffer entry-size)))
Here, the low-level archive_read_data
native method wants to be passed a buffer to write to. In the last section, we did this by making archiveReadData
allow it to write to a ByteString
value (which can be mutated as you can see, if you call the right "unsafe" operations, e.g. unsafeUseAsCStringLen
). So now our problem is simply to provide an appropriately sized ByteString
for the Haskell version of this routine to write to. We can do this with Data.ByteString.replicate
(which like malloc
and memset
). Then based on the actual number of bytes read, we can trim the ByteString
to the exact size with take:
archiveReadCurrentFile :: Archive -> Int -> IO ByteString archiveReadCurrentFile ar sz = do let bs = replicate sz 0 cnt <- archiveReadData ar bs return $ take cnt bs
This probably isn't the most efficient way to do this. A better API might use Data.ByteString.Internal.createAndTrim
which avoids making the final copy that take
makes. But this would require redesigning archiveReadData
to take a Ptr Word8
to write to instead, or simply calling archive_read_data
directly. Something we can revisit later.
Next, let's translate the archive-read-next-file-entry
function:
(define (archive-read-next-file-entry archive) (let-values ([(entry ret) (archive_read_next_header archive)]) (if (= ret ARCHIVE_OK) (archive-entry (archive_entry_pathname entry) (archive_entry_size entry) (rewrite-file-mode (archive_entry_filetype entry))) (if (= ret ARCHIVE_EOF) #f (raise (list "error" ret))))))
This is pretty straightforward. It simply calls the archive_read_next_header
function, checks the result, and builds a Racket archive-entry
structure (note the dash instead of underscore in the name) from 3 parts of the C archive_entry
struct that's returned. Now, I don't know why it's actually necessary to build another structure containing the archive_entry
components (perhaps because Alex wanted to pattern match against this structure?), but in Haskell, I'll just rely on the native methods we designed earlier to access the components of the archive_entry
:
archiveReadNextFileEntry :: Archive -> IO ArchiveEntry archiveReadNextFileEntry ar = do rv <- archiveReadNextHeader ar case rv of Left err -> error (show err) Right entry -> return entry
Earlier we had made archiveReadNextHeader
return a result of type Either Result ArchiveEntry
. And if you recall, an Either
value has Left
and Right
constructors that are used to distinguish either of 2 cases. Usually the Left
case is used to specify an error result, and the Right
case is for the common, non-error result. We could have returned a pair, (Result, Maybe ArchiveEntry)
, rather like the Racket code has done, but Either
is a slightly cleaner way to do the same thing.
Now rather than returning our Either
result, we want this routine to throw an exception for the error case, and simply return the ArchiveEntry
for the common case. The case
statement accomplishes this. Also, if you recall from earlier, we would like to be able to catch these exceptions, so we're going to swap out the error
function with ioError
and userError
. And one more thing: The whole routine can be simplified by using the either
function to say what to do in either of the 2 cases:
archiveReadNextFileEntry :: Archive -> IO ArchiveEntry archiveReadNextFileEntry ar = archiveReadNextHeader ar >>= either (ioError . userError . show) return
This says: Call the archiveReadNextHeader
function (that's in the IO monad), and pass its result to the either
function (rather like a unix pipe). Then either
takes 2 parameters: the first says if you get a Left
value, call the composed function (ioError . userError . show)
on it (remember, this is equivalent to (\left -> ioError (userError (show left)))
-- yes the dots seem to work backwards from most languages we're familiar with). In the Right
case, simply call return on the value.
Folding archives
Now we get to the fun part. Alex thought that a good API for visiting the elements of an archive file would be to fold over them. A fold is like a map that passes a piece of data long for the ride -- an accumulator value that it can change or replace as each file is visited. (When the accumulator is a list, you have a map operation exactly.) Here's the Racket fold routine:
(define (archive-folder archive proc init-val) (match (archive-read-next-file-entry archive) [(archive-entry path size filetype) (archive-folder archive proc (apply proc (list (archive-entry path size filetype) (lambda () (archive-read-current-file archive size)) init-val)))] [#f init-val]))
This calls archive-read-next-file-entry
, and when it returns an archive-entry
the user's proc
is called to handle it's data. The proc
is passed the archive-entry
(actually a copy of it -- tisk, tisk), a function that can be called to read the current file (a thunk -- a nullary function that's used when we want to delay evaluation -- rather like Haskell's laziness, don't you think?), and the accumulator value, init-val
. The result of the user's proc
becomes the new accumulator that's passed to a recursive call to archive-folder
which will go on to read the next entry.
In Haskell, we can write this almost identically:
archiveFolder :: Archive -> (ArchiveEntry -> IO ByteString -> a -> IO a) -> a -> IO a archiveFolder ar f acc = catch (do entry <- archiveReadNextFileEntry ar acc' <- f entry (readCurrentEntry entry) acc archiveFolder ar f acc') (\_ -> return acc) where readCurrentEntry entry = do sz <- archiveEntrySize entry bs <- archiveReadCurrentFile ar sz return bs
There are a few subtle differences here. Instead of testing the result of archiveReadNextFileEntry
, we're going to rely on catching an exception to know when to terminate the fold. And instead of using a thunk (which is rarely needed in a lazy language like Haskell), we're just going to pass an IO operation which is the result of calling the local function readCurrentEntry
.
Let's define a few additional routines in a separate file so that we can test this out:
module Main where import Codec.Archive.Libarchive import Data.ByteString hiding (map) import Data.Int import qualified Data.ByteString.Lazy as L import Prelude hiding (read, take) bufSize :: Int bufSize = 10240 archiveReadOpen :: String -> IO Archive archiveReadOpen path = do ar <- archiveReadNew expectOK $ archiveReadSupportCompressionAll ar expectOK $ archiveReadSupportFormatAll ar expectOK $ archiveReadOpenFile ar path bufSize return ar archiveFoldFile :: String -> (ArchiveEntry -> IO ByteString -> a -> IO a) -> a -> IO a archiveFoldFile path f acc = do ar <- archiveReadOpen path archiveFolder ar f acc readEntry :: ArchiveEntry -> IO ByteString -> [(Int, ByteString)] -> IO [(Int, ByteString)] readEntry entry reader acc = do t <- archiveEntryFiletype entry if t == ae_ifreg then do sz <- archiveEntrySize entry bs <- reader let sub = take (min sz 10) bs -- just keep some of the data for test purposes return $ (sz, sub):acc else return acc path = "/Users/warrenharris/projects/blog/comic/v-vendetta-0.cbz" test1 = archiveFoldFile path readEntry []
Let's call it:
*Main> test1 [(2535,"-=Book=-\r\r"),(956,"V For Vend"),(590199,"\255\216\255\224\NUL\DLEJFIF"),(374813,"\255\216\255\224\NUL\DLEJFIF"),(508622,"\255\216\255\224\NUL\DLEJFIF"),(518735,"\255\216\255\224\NUL\DLEJFIF"),(527499,"\255\216\255\224\NUL\DLEJFIF"),(451816,"\255\216\255\224\NUL\DLEJFIF"),(541239,"\255\216\255\224\NUL\DLEJFIF"),(481156,"\255\216\255\224\NUL\DLEJFIF"),(516998,"\255\216\255\224\NUL\DLEJFIF"),(510813,"\255\216\255\224\NUL\DLEJFIF"),(478150,"\255\216\255\224\NUL\DLEJFIF"),(437432,"\255\216\255\224\NUL\DLEJFIF"),(310424,"\255\216\255\224\NUL\DLEJFIF"),(265642,"\255\216\255\224\NUL\DLEJFIF"),(301822,"\255\216\255\224\NUL\DLEJFIF"),(676613,"\255\216\255\224\NUL\DLEJFIF"),(53109,"\255\216\255\224\NUL\DLEJFIF"),(452430,"\255\216\255\224\NUL\DLEJFIF"),(305523,"\255\216\255\224\NUL\DLEJFIF"),(319714,"\255\216\255\224\NUL\DLEJFIF"),(298048,"\255\216\255\224\NUL\DLEJFIF")]
Once thing you'll notice here is that the data got reversed. The last entry (beginning with "-=Book=-\r\r") got put at the front of the list. This is because the fold routine that we wrote is essentially a left-fold, which means it starts from the left (the beginning of the archive contents) and pushes its results into a list as it goes. So the list has the first element on the right. No big deal. We can simply reverse it if we want.
Folding archives -- take 2
Now the first thing the astute reader will notice about archiveFolder
is that it really isn't necessary to provide readCurrentEntry
at all. This is something the user's function can implement as long as it has the Archive
and ArchiveEntry
values in hand. We might rewrite this fold routine as follows:
archiveFolder2 :: Archive -> (Archive -> ArchiveEntry -> a -> IO a) -> a -> IO a archiveFolder2 ar f acc = catch (do entry <- archiveReadNextFileEntry ar acc' <- f ar entry acc archiveFolder2 ar f acc') (\_ -> return acc)
Now let's provide a few additional routines so that we can test this out:
archiveFoldFile2 :: String -> (Archive -> ArchiveEntry -> a -> IO a) -> a -> IO a archiveFoldFile2 path f acc = do ar <- archiveReadOpen path archiveFolder2 ar f acc readEntry2 :: Archive -> ArchiveEntry -> [(Int, ByteString)] -> IO [(Int, ByteString)] readEntry2 ar entry acc = do t <- archiveEntryFiletype entry if t == ae_ifreg then do sz <- archiveEntrySize entry bs <- archiveReadCurrentFile ar sz let sub = take (min sz 10) bs return $ (sz, sub):acc else return acc test2 = archiveFoldFile2 path readEntry2 []
We get the exact same result if we call this one. All we've done is moved a bit of the work into the readEntry2
function.
Step away from the fold
Now the problem with using a fold operation to traverse an archive file is that it doesn't give you a way to control which parts of the archive get visited. Granted, the user's proc
can decided not to slurp in the contents of an archive_entry
, but it can't decided that it doesn't want to continue visiting entries. The fold operation is going to continue marching to the end of the archive, whether you've got enough data or not.
Now it might be possible to use exceptions, continuations or delimited control to provide an escape hatch for fold, but a nicer abstraction would be one that didn't have to resort to escape hatches at all -- one that just build the data structure you were about to use as you decided you needed it. Just like a file read
operation that streams in chunks of the file each successive time read
is called, wouldn't it be nice if we could provide a structured stream of data that only created the data values from the raw archive data as necessary? That's exactly what lazy IO can do for you.
If you remember what I said earlier, the IO monad isn't lazy at all. It's all about sequencing "real world" operations in an order that's predictable and timely. So before we get into the lazy part of lazy IO, let's just design a plain old IO operation that reads the entire archive.
To read an archive, we need to alternatively read headers and archive data -- so there's essentially 2 states to the reading process. And as we read, we need to accumulate this data somewhere. Perhaps the most natural place to accumulate it is as a list of ByteStrings -- one for each archive_entry
.
However, instead of regular (strict) ByteString
s, we're going to use lazy ByteString
s instead. There are 2 reasons for this: First, a lazy ByteString
provides the same interface abstraction as a regular ByteString
, but underneath is uses a sequence of buffer segments to allow data to be brought in as needed. Second, is because it has laziness built into it from the get-go. Although we're not going to use the lazy aspect of L.ByteString
just yet, we'll need it eventually -- just just hold on.
Let's start from the bottom up, by first providing a function to return a lazy ByteString
for each archive_entry
. It's easy to make a L.ByteString
from a list of strict ones simply by calling L.fromChunks:
archiveReadEntry :: Ptr ARCHIVE -> IO L.ByteString archiveReadEntry a = do chunks <- archiveReadEntryChunks a return $ L.fromChunks chunks archiveReadEntryChunks :: Ptr ARCHIVE -> IO [ByteString] archiveReadEntryChunks a = do bs <- archiveReadEntryChunk a if length bs == 0 then return [] else do rest <- archiveReadEntryChunks a return (bs:rest) archiveReadEntryChunk :: Ptr ARCHIVE -> IO ByteString archiveReadEntryChunk a = do buf <- mallocArray bufSize rv <- archive_read_data a buf (fromIntegral bufSize) unsafePackCStringFinalizer (castPtr buf) (fromIntegral rv) (free buf)
Here I've cheated a bit and made all these functions take a Ptr ARCHIVE
instead of an Archive
. That's so that I can unwrap the Archive
once at the beginning, and also because I don't plan to export these functions from the module. Note that we're malloc
ing fixed-size chunks and calling archive_read_data
to fill them. Then we're using unsafePackCStringFinalizer
to associate each chunk with its free
finalizer. That's a little more efficient than what we had before.
Next we can define a function that alternately reads headers and entry contents, and does this recursively until the end of the archive is reached:
archiveRead :: Archive -> IO [L.ByteString] archiveRead ar = do withForeignPtr (unArchive ar) $ \a -> do rv <- archiveReadNextHeader ar case rv of Right entry -> do t <- archiveEntryFiletype entry if t == ae_ifreg then do bs <- archiveReadEntry a rest <- archiveRead ar return (bs:rest) else archiveRead ar Left _ -> return []
This says to read an archive, first read a header. If there is no error, check the archive entry's file type, and if it's a regular file, read the archive's contents followed by a recursive call to read the rest of the archive. If it's not a regular file, we'll skip that entry and continue on to the next one. We'll continue doing this until archiveReadNextHeader
returns an error.
Now we're almost done. The only thing that remains is this little detail I mentioned earlier -- the fact that the IO monad is strict, and as written this will pull the entire contents of the archive into memory, building the list of (fully read in) L.ByteString
s as it goes. What we need to do is introduce a little delay in a couple of critical places, so that some of these IO operations aren't performed until they're needed. This is a job for unsafeInterleaveIO
.
There are 2 places where unsafeInterleaveIO
is needed. First, we need to delay reading in the tail of the list of L.ByteString
s until more entries are needed. Second, we need to delay reading in the entire contents of any individual entry, so that its L.ByteString
isn't fully materialized until the data is actually consumed. The first one of these goes in archiveRead
itself, at the point where the recursive call is about to be made:
archiveRead :: Archive -> IO [L.ByteString] archiveRead ar = do withForeignPtr (unArchive ar) $ \a -> do rv <- archiveReadNextHeader ar case rv of Right entry -> do t <- archiveEntryFiletype entry if t == ae_ifreg then do bs <- archiveReadEntry a rest <- unsafeInterleaveIO $ archiveRead ar return (bs:rest) else archiveRead ar Left _ -> return []
The second one goes in archiveReadEntryChunks
, again at the point where it is about to be called recursively:
archiveReadEntryChunks :: Ptr ARCHIVE -> IO [ByteString] archiveReadEntryChunks a = do bs <- archiveReadEntryChunk a if length bs == 0 then return [] else do rest <- unsafeInterleaveIO $ archiveReadEntryChunks a return (bs:rest)
Now the cool thing about lazy IO is that nothing gets read in until it is needed, but due to the sequencing savviness of monads, everything happens in exactly the same order as if it had been read in eagerly. So we get the best of both worlds -- an elegant abstraction (a list of ByteStrings
), and the benefits of streaming.
Conclusion
At this point I'm sure your disappointed to find that this article has had very little to do with comics, and in general hasn't been very funny at all. But stay tuned for part 3, where I'll attempt to put it this library to use and actually build that web app for your comic viewing pleasure.
No comments:
Post a Comment