NTFS Documentation: Current Compression Engine

here's a short summary of the mechanism:
The compression engine operates on blocks of 4096 bytes of plain-text
data. These are compressed using a modified LZ77 algorithm. The basic
idea is that substrings of the block which have been seen before are
compressed by referencing the string rather than mentioning it again.
For example, consider the plain text

#include <ntfs.h>\n
#include <stdio.h>\n

This is compressed to

#include <ntfs.h>\n
(-18,10)stdio(-17,4)

So the algorithm recognizes that -18 bytes from the current position,
it has already seen the text '#include <'. Then, stdio is new, but
'.h>\n' has been seen before.

The interesting details are in the question? How to encode the pair
(-18,10), and how to mix this with plain-text strings.
The first thing to understand is that such a pair is recorded in two
bytes. Because a back-reference takes two bytes, there is no point in
back-referencing one- or two-byte substrings. This means the shortest
possible substring is 3. This means that length values of 0, 1, and 2
are not possible. So you can subtract 3 of the length before encoding
it. Also, the references are always backward, and never 0. So you can
store them as positive numbers, and subtract one. The first
back-reference is stored as (17,7), and the second one as (16,1).

Given that a block is 4096 in size, you might need 12 bits to encode
the back reference. This means that you have only for bits left to
encode the length, allowing for a maximum length of 19. This is not
desirable as it limits to compression ratio to 1:19. OTOH, if the
current offset is, say, 123, a back reference of -512 is not
possible. Some clever MS engineer decided to dynamically allocate more
bits for the back-reference and less for the length. The exact split
can be written as a table, or as
for(i=clear_pos-1,lmask=0xFFF,dshift=12;i>=0x10;i>>=1){ 
        lmask >>= 1; /* bit mask for length */
        dshift--;    /* shift width for delta */
}

Now that we can encode a (offset,length) pair as two bytes, we still
have to know whether a token is a back-reference, or plain-text. This
is one bit per token. Eight tokens are grouped together and preceded
with the tags byte. So the group
>\n(18,10)stdio
would be encoded as 
00000100 > \n 0A 90 s t d i o
(the 1 bit indicates the back reference). As an extreme case, a block
of all space (' ') is compressed as

00000010 ' ' FC 0F

or ' ' (-1,4095). This works because you always read data you just
stored.
As a compression unit consists of 16 clusters, it usually contains
more than one of these blocks. If you want to access the second block,
it would be a waste of time to decompress the first one. Instead, each
block is preceded by a 2-byte length. The lower twelve bits are the
length, the higher 4 bits are of unknown purpose.

If you look at the algorithm, you will notice that it will not always
reduce the data size. If there are no back references, each
byte plain-text will remain as-is. However, every 8 bytes, a tag bit
is inserted, which then will be zero. So, in the worst case, a block
might grow to 4610 bytes (counting the length of the block). If the
block grows in size, it will be stored uncompressed. A length of
exactly 4095 is used to indicate this case. It might be still possible
that the following block will compress well, reducing the total size
of the chunk. If it doesn't, the entire chunk is stored uncompressed,
which is indicated in the run list.

> each block is preceded by a 2-byte length. The lower twelve bits are the
> length, the higher 4 bits are of unknown purpose.

Bit 0x8000 is the flag specifying that the block is compressed.  The
compression code OR's in the value 0xB000 (if its compressed), but the 
decompression code only looks at bit 0x8000.

Also, the length is actually stored as (n-3) in the low 12 bits.
Actually, (n-1) if you don't count the two bytes used to store the
length itself.  So for an uncompressed block the length is stored as 0xFFF,
meaning the length is 4096 + 2 more bytes holding the length itself.

A 0x1000 length block compressed to length 0x500 would require 0x502 bytes,
and have an advertised length of 0x4FF.

What I don't know is whether a 16 cluster file that doesn't compress
at all requires 17 clusters to store, in order to accomodate the extra
2 bytes per block.

I believe it will take only 16 clusters. The fact that it is not
compressed will be expressed in the run list. For example, the
compressed file will look like
(1000 A) (0 6)       //(rel.VCN length)
whereas the uncompressable file will look like
(1000 10)
or
(1000 A) (1040 6)
IOW, if you don't have any runs with VCN==0 in the 16 clusters, the
chunk is entirely uncompressed and plain.
Given the compression algorithm, it is fairly easy to create such a
file:
s=""
for i in range(0,16):   #adjust to clusters >512 if necessary
  for j in range(0,255):
    s=s+chr(i)+chr(j)
open("uncompressable","w").write(s)

Regis Duchesne at VIA, ECP, France
Last modified: Tue Dec 23 14:46:41 CET 1997