mirror of
https://github.com/HackerN64/HackerOoT.git
synced 2026-01-21 10:37:37 -08:00
239 lines
8.5 KiB
Markdown
239 lines
8.5 KiB
Markdown
# BPS File Format Specification
|
|
|
|
_by byuu. Public domain._ https://byuu.org/
|
|
|
|
BPS patches encode the differences between two files.
|
|
|
|
string "BPS1"
|
|
number source-size
|
|
number target-size
|
|
number metadata-size
|
|
string metadata[metadata-size]
|
|
repeat {
|
|
number action | ((length - 1) << 2)
|
|
action 0: SourceRead {
|
|
}
|
|
action 1: TargetRead {
|
|
byte[] length
|
|
}
|
|
action 2: SourceCopy {
|
|
number negative | (abs(offset) << 1)
|
|
}
|
|
action 3: TargetCopy {
|
|
number negative | (abs(offset) << 1)
|
|
}
|
|
}
|
|
uint32 source-checksum
|
|
uint32 target-checksum
|
|
uint32 patch-checksum
|
|
|
|
## Synonyms
|
|
|
|
A few terms are used interchangeably as appropriate.
|
|
|
|
* source = original file (input)
|
|
* target = modified file (output)
|
|
* number = variable-length integer encoding
|
|
|
|
## Linear creation
|
|
|
|
Rather than encoding a list of changes to files (insert data here, delete data
|
|
here, modify data here, ...); beat patches encode the steps needed to create a
|
|
new file from an old file. That is to say, the target file starts off as an
|
|
empty, zero-byte file. The patch commands tell us how to write each and every
|
|
byte of the target file sequentially.
|
|
|
|
## Variable-length number encoding
|
|
|
|
Rather than limit the maximum file size supported to 16MB (24-bit) or 4GB
|
|
(32-bit), beat patches use a variable-length encoding to support any number of
|
|
bits, and thus, any possible file size.
|
|
|
|
The basic idea is that we encode the lowest seven bits of the number, and then
|
|
the eighth bit of each byte is a flag to say whether the full number has been
|
|
represented or not. If set, this is the last byte of the number. If not, then
|
|
we shift out the low seven bits and repeat until the number is fully encoded.
|
|
|
|
One last optimization is to subtract one after each encode. Without this, one
|
|
could encode '1' with 0x81 or 0x01 0x80, producing an ambiguity.
|
|
|
|
Decoding is the inverse of the above process.
|
|
|
|
Below are C++ implementations of this idea. Note that we are using uint64 for
|
|
the data type: this will limit beat patches created with these algorithms to
|
|
64-bit file sizes. If 128-bit integers were available, they could be used
|
|
instead. Of course, it's silly to even imagine patching a file larger than 16
|
|
exabytes, but beat does allow it.
|
|
|
|
**Encoding**
|
|
|
|
void encode(uint64 data) {
|
|
while(true) {
|
|
uint8 x = data & 0x7f;
|
|
data >>= 7;
|
|
if(data == 0) {
|
|
write(0x80 | x);
|
|
break;
|
|
}
|
|
write(x);
|
|
data--;
|
|
}
|
|
}
|
|
|
|
**Decoding**
|
|
|
|
uint64 decode() {
|
|
uint64 data = 0, shift = 1;
|
|
while(true) {
|
|
uint8 x = read();
|
|
data += (x & 0x7f) * shift;
|
|
if(x & 0x80) break;
|
|
shift <<= 7;
|
|
data += shift;
|
|
}
|
|
return data;
|
|
}
|
|
|
|
--------------------------------------------------------------------------------
|
|
|
|
## Header
|
|
|
|
First, we have the file format marker, "BPS1". We then encode the source and
|
|
target file sizes. Next, we encode optional metadata. If no metadata is present,
|
|
store an encoded zero here (0x80 per above.) Otherwise, specify the length of
|
|
the metadata.
|
|
|
|
Note that officially, metadata should be XML version 1.0 encoding UTF-8 data,
|
|
and the metadata-size specifies the actual length. As in, there is no
|
|
null-terminator after the metadata. However, the actual contents here are
|
|
entirely domain-specific, so literally anything can go here and the patch will
|
|
still be considered valid.
|
|
|
|
## Transfer lengths
|
|
|
|
We store lengths as length - 1 to prevent ambiguities. There is no sense in
|
|
encoding a command that ultimately does nothing. This also slightly helps with
|
|
patch size reduction in some cases.
|
|
|
|
## Relative offsets
|
|
|
|
beat patches keep track of the current file offsets in both the source and
|
|
target files separately. Reading from either increments their respective offsets
|
|
automatically.
|
|
|
|
As such, offsets are encoded relatively to the current positions. These offsets
|
|
can move the read cursors forward or backward. To support negative numbers with
|
|
variable-integer encoding requires us to store the negative flag as the lowest
|
|
bit, followed by the absolute value (eg abs(-1) = 1)
|
|
|
|
Note, and this is very important, for obvious reasons you cannot read from
|
|
before the start or after the end of the file. Further, you cannot read beyond
|
|
the current target write output offsets, as that data is not yet available.
|
|
Attempting to do so instantly makes the patch invalid and will abort patching
|
|
entirely.
|
|
|
|
**outputOffset:** this is a value that starts at zero. Every time a byte is
|
|
written to the target file, this offset is incremented by one.
|
|
|
|
**sourceRelativeOffset:** this is a value that starts at zero. SourceCopy will
|
|
adjust this value by a signed amount, and then increment the value by one for
|
|
each read performed by said command. This value can never be less than zero, or
|
|
greater than or equal to the source file size.
|
|
|
|
**targetRelativeOffset:** this is a value that starts at zero. TargetCopy will
|
|
adjust this value by a signed amount, and then increment the value by one for
|
|
each read performed by said command. This value can never be less than zero, or
|
|
greater than or equal to the outputOffset.
|
|
|
|
## Repeat
|
|
|
|
Commands repeat until the end of the patch. This can be detected by testing the
|
|
patch read location, and stopping when offset() >= size() - 12. Where 12 is the
|
|
number of bytes in the patch footer.
|
|
|
|
void action() {
|
|
uint64 data = decode();
|
|
uint64 command = data & 3;
|
|
uint64 length = (data >> 2) + 1;
|
|
}
|
|
|
|
## SourceRead
|
|
|
|
This command copies bytes from the source file to the target file. Since both
|
|
the patch creator and applier will have access to the entire source file, the
|
|
actual bytes to output do not need to be stored here.
|
|
|
|
This command is rarely useful in delta patch creation, and is mainly intended to
|
|
allow for linear-based patchers. However, at times it can be useful even in
|
|
delta patches when data is the same in both source and target files at the same
|
|
location.
|
|
|
|
void sourceRead() {
|
|
while(length--) {
|
|
target[outputOffset] = source[outputOffset];
|
|
outputOffset++;
|
|
}
|
|
}
|
|
|
|
## TargetRead
|
|
|
|
When a file is modified, new data is thus created. This command can store said
|
|
data so that it can be written to the target file. This time, the actual data is
|
|
not available to the patch applier, so it is stored directly inside the patch.
|
|
|
|
void targetRead() {
|
|
while(length--) {
|
|
target[outputOffset++] = read();
|
|
}
|
|
}
|
|
|
|
## SourceCopy
|
|
|
|
This command treats the entire source file as a dictionary, similarly to LZ
|
|
compression. An offset is supplied to seek the sourceRelativeOffset to the
|
|
desired location, and then data is copied from said offset to the target file.
|
|
|
|
void sourceCopy() {
|
|
uint64 data = decode();
|
|
sourceRelativeOffset += (data & 1 ? -1 : +1) * (data >> 1);
|
|
while(length--) {
|
|
target[outputOffset++] = source[sourceRelativeOffset++];
|
|
}
|
|
}
|
|
|
|
## TargetCopy
|
|
|
|
This command treats all of the data that has already been written to the target
|
|
file as a dictionary. By referencing already written data, we can optimize
|
|
repeated data in the target file that does not exist in the source file.
|
|
|
|
This can allow for efficient run-length encoding. For instance, say 16MB of
|
|
0x00s appear in a row in only the target file. We can use TargetRead to write a
|
|
single 0x00. Now we can use TargetCopy to point at this byte, and set the length
|
|
to 16MB-1. The effect will be that the target output size grows as the command
|
|
runs, thus repeating the data.
|
|
|
|
void targetCopy() {
|
|
uint64 data = decode();
|
|
targetRelativeOffset += (data & 1 ? -1 : +1) * (data >> 1);
|
|
while(length--) {
|
|
target[outputOffset++] = target[targetRelativeOffset++];
|
|
}
|
|
}
|
|
|
|
## Footer
|
|
|
|
Checksum information appears at the bottom of the file. The idea is to allow a
|
|
patcher to calculate this information as the patch is being produced. The source
|
|
checksum verifies that the input file is correct, and the target checksum
|
|
verifies that the patch has been applied successfully. Finally, the patch itself
|
|
has a checksum. The patch checksum is the checksum of every byte before it. In
|
|
other words, it does not include the last four bytes for obvious reasons. This
|
|
ensures the patch itself has not been corrupted.
|
|
|
|
Note that checksums are stored in CRC32 format. The intention of checksums is to
|
|
verify against corruption and mistakenly incorrect files. The idea was to keep
|
|
the file format simple, so cryptographically secure hashes were not used here.
|
|
If security is a grave concern, SHA256 or better hashes can be stored in the
|
|
manifest data. Otherwise, beat is not the right file format for such uses.
|