Signed-off-by: pengjingtong <pengjingtong@huawei.com>
17 KiB
third_party_lzma
介绍
LZMA 是著名的LZ77压缩算法的改良版本, 最大化地提高了压缩比率, 保持了高压缩速度和解压缩时较低的内存需要。
LZMA2 基于 LZMA, 在压缩过程中提供了更好的多线程支持, 和其他改进优化。
7z 是一种数据压缩和文件档案的格式, 是7zip软件的主要文件格式 7z官网。 7z 格式支持不同的压缩方式: LZMA, LZMA2 和其他, 同时也支持基于AES-256的对称加密。
XZ 是一种使用LZMA2数据压缩的文件格式, XZ格式带有额外的特性: SHA/CRC数据校验, 用于提升压缩比率的filters, 拆分blocks和streams。
软件架构
软件架构说明
| format/algorithm | C | C++ | C# | Java |
|---|---|---|---|---|
| LZMA 压缩和解压缩 | ✓ | ✓ | ✓ | ✓ |
| LZMA2 压缩和解压缩 | ✓ | ✓ | ||
| XZ 压缩和解压缩 | ✓ | ✓ | ||
| 7Z 解压缩 | ✓ | ✓ | ||
| 7Z 压缩 | ✓ | |||
| small SFXs for installers (7z decompression) | ✓ | |||
| SFXs and SFXs for installers (7z decompression) | ✓ |
/third_party/lzma
├── Asm # asm files (optimized code for CRC calculation and Intel-AES encryption)
│ ├── arm
│ ├── arm64
│ └── x86
├── C # C files (compression / decompression and other)
│ └── Util
│ ├── 7z # 7z decoder program (decoding 7z files)
│ ├── Lzma # LZMA program (file->file LZMA encoder/decoder)
│ ├── LzmaLib # LZMA library (.DLL for Windows)
│ └── SfxSetup # small SFX module for installers
├── CPP
│ ├── Common # common files for C++ projects
│ ├── Windows # common files for Windows related code
│ └── 7zip # files related to 7-Zip
│ ├── Archive # files related to archiving
│ │ ├── Common # common files for archive handling
│ │ └── 7z # 7z C++ Encoder/Decoder
│ ├── Bundles # Modules that are bundles of other modules (files)
│ │ ├── Alone7z # 7zr.exe: Standalone 7-Zip console program (reduced version)
│ │ ├── Format7zExtractR # 7zxr.dll: Reduced version of 7z DLL: extracting from 7z/LZMA/BCJ/BCJ2.
│ │ ├── Format7zR # 7zr.dll: Reduced version of 7z DLL: extracting/compressing to 7z/LZMA/BCJ/BCJ2
│ │ ├── LzmaCon # lzma.exe: LZMA compression/decompression
│ │ ├── LzmaSpec # example code for LZMA Specification
│ │ ├── SFXCon # 7zCon.sfx: Console 7z SFX module
│ │ ├── SFXSetup # 7zS.sfx: 7z SFX module for installers
│ │ └── SFXWin # 7z.sfx: GUI 7z SFX module
│ ├── Common # common files for 7-Zip
│ ├── Compress # files for compression/decompression
│ ├── Crypto # files for encryption / decompression
│ └── UI # User Interface files
│ ├── Client7z # Test application for 7za.dll, 7zr.dll, 7zxr.dll
│ ├── Common # Common UI files
│ ├── Console # Code for console program (7z.exe)
│ ├── Explorer # Some code from 7-Zip Shell extension
│ ├── FileManager # Some GUI code from 7-Zip File Manager
│ └── GUI # Some GUI code from 7-Zip
├── CS
│ └── 7zip
│ ├── Common # some common files for 7-Zip
│ └── Compress # files related to compression/decompression
│ ├── LZ # files related to LZ (Lempel-Ziv) compression algorithm
│ ├── LZMA # LZMA compression/decompression
│ ├── LzmaAlone # file->file LZMA compression/decompression
│ └── RangeCoder # Range Coder (special code of compression/decompression)
├── DOC
│ ├── 7zC.txt # 7z ANSI-C Decoder description
│ ├── 7zFormat.txt # 7z Format description
│ ├── installer.txt # information about 7-Zip for installers
│ ├── lzma-history.txt # history of LZMA SDK
│ ├── lzma-sdk.txt # LZMA SDK description
│ ├── lzma-specification.txt # Specification of LZMA
│ ├── lzma.txt # LZMA compression description
│ └── Methods.txt # Compression method IDs for .7z
└── Java
└── SevenZip
└── Compression # files related to compression/decompression
├── LZ # files related to LZ (Lempel-Ziv) compression algorithm
├── LZMA # LZMA compression/decompression
└── RangeCoder # Range Coder (special code of compression/decompression)
证书
LZMA SDK is written and placed in the public domain by Igor Pavlov.
Some code in LZMA SDK is based on public domain code from another developers:
-
PPMd var.H (2001): Dmitry Shkarin
-
SHA-256: Wei Dai (Crypto++ library)
Anyone is free to copy, modify, publish, use, compile, sell, or distribute the original LZMA SDK code, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.
LZMA SDK code is compatible with open source licenses, for example, you can include it to GNU GPL or GNU LGPL code.
编译构建
UNIX/Linux
使用gcc和clang编译7-zip有多种选项,同时7-zip代码中两部分重要的代码: C和汇编。如果与汇编代码一起编译版本,会得到更快的7-zip二进制。7-zip的汇编代码遵循不同平台的语法。
arm64
gcc和clang arm64版本支持arm64汇编代码语法。
x86 and x86_64(AMD64)
Asmc Macro Assembler 和 JWasm 在Linux 系统上都支持MASM语法,但JWasm 不支持一些7-zip中使用的cpu指令。 如果你想编译更快的7zip,必须在Linux上安装Asmc Macro Assembler https://github.com/nidud/asmc
构建命令
目录中有两个主要文件用于编译 makefile - 使用nmake命令编译Windows版本的7zip makefile.gcc - 使用make命令编译Linux/macOs版本的7zip
首先切换到包含 makefile.gcc的目录下:
cd CPP/7zip/Bundles/Alone7z
make -j -f makefile.gcc
另外在"CPP/7zip/"目录下的"*.mak"文件也可以与优化的代码同时编译,并且带有优化选项。比如:
cd CPP/7zip/Bundles/Alone7z
make -j -f ../../cmpl_gcc.mak
接口使用说明
这部分描述了C语言实现的LZMA编码和解码函数
注意: 你也可以阅读参考 LZMA Specification (lzma-specification.txt from LZMA SDK)
你也可以查看使用LZMA编码和解码的案例: C/Util/Lzma/LzmaUtil.c
LZMA 压缩的文件格式
Offset Size Description
0 1 Special LZMA properties (lc,lp, pb in encoded form)
1 4 Dictionary size (little endian)
5 8 Uncompressed size (little endian). -1 means unknown size
13 Compressed data
ANSI-C(American National Standards Institue) LZMA Decoder 请注意ANSI-C的接口在LZMA SDK 4.58版本发生了变更,如果你想使用旧的接口,你可以从sourceforge.net 网站下载之前的LZMA SDK版本。
使用 ANSI-C LZMA Decoder需要使用到以下文件:
LzmaDec.h
LzmaDec.c
7zTypes.h
Precomp.h
Compiler.h
参考案例: C/Util/Lzma/LzmaUtil.c
LZMA decoding的内存要求
-
LZMA decoding函数局部变量的栈内存不超过200-400字节
-
LZMA Decoder使用字典缓冲区和内部state结构
-
内部state结构size消耗state_size = (4 + (1.5 << (lc + lp))) KB by default (lc=3, lp=0), state_size = 16 KB.
如何解压缩
LZMA Decoder (ANSI-C version) 支持以下两种接口:
1) 单次调用: LzmaDecode
2) 多次调用:LzmaDec_DecodeToBuf(类似于zlib接口)
你必须自己定义内存分配器:
Example:
void *SzAlloc(void *p, size_t size) { p = p; return malloc(size); }
void SzFree(void *p, void *address) { p = p; free(address); }
ISzAlloc alloc = { SzAlloc, SzFree };
You can use p = p; operator to disable compiler warnings.
单次调用
- 使用场景: RAM->RAM decompressing
- 编译文件: LzmaDec.h + LzmaDec.c + 7zTypes.h
- 编译宏: 不需要
- 内存需要:
- Input buffer: compressed size
- Output buffer: uncompressed size
- LZMA Internal Structures: state_size (16 KB for default settings)
Interface:
int LzmaDecode(Byte *dest, SizeT *destLen, const Byte *src, SizeT *srcLen,
const Byte *propData, unsigned propSize, ELzmaFinishMode finishMode,
ELzmaStatus *status, ISzAlloc *alloc);
In:
dest - output data
destLen - output data size
src - input data
srcLen - input data size
propData - LZMA properties (5 bytes)
propSize - size of propData buffer (5 bytes)
finishMode - It has meaning only if the decoding reaches output limit (*destLen).
LZMA_FINISH_ANY - Decode just destLen bytes.
LZMA_FINISH_END - Stream must be finished after (*destLen).
You can use LZMA_FINISH_END, when you know that
current output buffer covers last bytes of stream.
alloc - Memory allocator.
Out:
destLen - processed output size
srcLen - processed input size
Output:
SZ_OK
status:
LZMA_STATUS_FINISHED_WITH_MARK
LZMA_STATUS_NOT_FINISHED
LZMA_STATUS_MAYBE_FINISHED_WITHOUT_MARK
SZ_ERROR_DATA - Data error
SZ_ERROR_MEM - Memory allocation error
SZ_ERROR_UNSUPPORTED - Unsupported properties
SZ_ERROR_INPUT_EOF - It needs more bytes in input buffer (src).
如果LZMA decoder 在输出缓冲区上限前到达并看到了end_marker, 返回OK,同时输出的destLen的值会比输出缓冲区的上限小。
你可以在完全解压缩后使用多重检查数据的完整性:
- 检查返回值和status变量
- 如果你已知未压缩的数据大小,检查 output(destLen) = uncompressedSize
- 如果你已知压缩后的数据大小,检查 output(srcLen) = compressedSize
根据状态多次调用 (类似于zlib接口)
- 使用场景: file->file decompressing
- 编译文件: LzmaDec.h + LzmaDec.c + 7zTypes.h
- 内存要求:
- Buffer for input stream: any size (for example, 16 KB)
- Buffer for output stream: any size (for example, 16 KB)
- LZMA Internal Structures: state_size (16 KB for default settings)
- LZMA dictionary (字典大小编码在LZMA properties header中)
使用流程:
1) 读取 LZMA properties (5 bytes) and uncompressed size (8 bytes, 小端序) 到 header:
unsigned char header[LZMA_PROPS_SIZE + 8];
ReadFile(inFile, header, sizeof(header)
2) 使用"LZMA properties"分配创建 CLzmaDec(state + dictionary)
CLzmaDec state;
LzmaDec_Constr(&state);
res = LzmaDec_Allocate(&state, header, LZMA_PROPS_SIZE, &g_Alloc);
if (res != SZ_OK)
return res;
3) 初始化LzmaDec,在循环中调用LzmaDec_DecodeToBuf
LzmaDec_Init(&state);
for (;;)
{
...
int res = LzmaDec_DecodeToBuf(CLzmaDec *p, Byte *dest, SizeT *destLen,
const Byte *src, SizeT *srcLen, ELzmaFinishMode finishMode);
...
}
4) 释放所有分配的结构
LzmaDec_Free(&state, &g_Alloc);
Look example code: C/Util/Lzma/LzmaUtil.c
如何压缩数据
1 编译文件:
7zTypes.h
Threads.h
LzmaEnc.h
LzmaEnc.c
LzFind.h
LzFind.c
LzFindMt.h
LzFindMt.c
LzHash.h
2 内存需要:
- (dictSize * 11.5 + 6 MB) + state_size
Lzma Encoder 可使用两种内存分配器:
- alloc - for small arrays.
- allocBig - for big arrays.
例如,你可以在allocBig分配器中使用大RAM页(2 MB)来获得更快的压缩速度。需要注意的是Windows对于大RAM页的实现较差。alloc和allocBig也可以使用相同的分配器。
带有回调的单次压缩
Look example code: C/Util/Lzma/LzmaUtil.c
使用场景: file->file compressing
1) 你必须实现接口的回调函数
ISeqInStream
ISeqOutStream
ICompressProgress
ISzAlloc
static void *SzAlloc(void *p, size_t size) { p = p; return MyAlloc(size); }
static void SzFree(void *p, void *address) { p = p; MyFree(address); }
static ISzAlloc g_Alloc = { SzAlloc, SzFree };
CFileSeqInStream inStream;
CFileSeqOutStream outStream;
inStream.funcTable.Read = MyRead;
inStream.file = inFile;
outStream.funcTable.Write = MyWrite;
outStream.file = outFile;
2) 创建CLzmaEncHandle对象
CLzmaEncHandle enc;
enc = LzmaEnc_Create(&g_Alloc);
if (enc == 0)
return SZ_ERROR_MEM;
3) 初始化CLzmaEncProps属性
LzmaEncProps_Init(&props);
之后你可以改变这个结构里的一些属性
4) 把上一个步骤设置的属性设置给LZMA Encoder
res = LzmaEnc_SetProps(enc, &props);
5) 将编码的属性写入header
Byte header[LZMA_PROPS_SIZE + 8];
size_t headerSize = LZMA_PROPS_SIZE;
UInt64 fileSize;
int i;
res = LzmaEnc_WriteProperties(enc, header, &headerSize);
fileSize = MyGetFileLength(inFile);
for (i = 0; i < 8; i++)
header[headerSize++] = (Byte)(fileSize >> (8 * i));
MyWriteFileAndCheck(outFile, header, headerSize)
6) 调用编码函数
res = LzmaEnc_Encode(enc, &outStream.funcTable, &inStream.funcTable,
NULL, &g_Alloc, &g_Alloc);
7) 删除LZMA Encoder对象
LzmaEnc_Destroy(enc, &g_Alloc, &g_Alloc);
如果回调函数返回某些错误码,LzmaEnc_Encode 也会返回该错误码或者返回类似于SZ_ERROR_READ, SZ_ERROR_WRITE or SZ_ERROR_PROGRESS。
单次调用 RAM->RAM 压缩
单次调用,RAM->RAM 压缩与设置回调的方式压缩类似, 但你需要提供指向buffers的指针而不是指向回调函数的指针。
SRes LzmaEncode(Byte *dest, SizeT *destLen, const Byte *src, SizeT srcLen,
const CLzmaEncProps *props, Byte *propsEncoded, SizeT *propsSize, int writeEndMark,
ICompressProgress *progress, ISzAlloc *alloc, ISzAlloc *allocBig);
Return code:
SZ_OK - OK
SZ_ERROR_MEM - Memory allocation error
SZ_ERROR_PARAM - Incorrect paramater
SZ_ERROR_OUTPUT_EOF - output buffer overflow
SZ_ERROR_THREAD - errors in multithreading functions (only for Mt version)
宏
_LZMA_SIZE_OPT - Enable some optimizations in LZMA Decoder to get smaller executable code.
_LZMA_PROB32 - It can increase the speed on some 32-bit CPUs, but memory usage for
- some structures will be doubled in that case.
_LZMA_UINT32_IS_ULONG - Define it if int is 16-bit on your compiler and long is 32-bit.
_LZMA_NO_SYSTEM_SIZE_T - Define it if you don't want to use size_t type.
_7ZIP_PPMD_SUPPPORT - Define it if you don't want to support PPMD method in AMSI-C .7z decoder.
C++版本的 LZMA Encoder/Decoder
C++版本的 LZMA 代码使用COM-LIKE接口。如果你想使用,可以了解下COM(Component Object Model)/OLE(Object Linking and Embedding)/DDE(Dynamic Data Exchange)的基础。
C++版本的 LZMA 代码部门仅仅只是将ANSI-C代码包装了.
注意: 如果你使用7zip目录下的C++代码,你必须检查你正确地使用new 运算符 MSVC 6.0 编译7-zip时,不会抛出 new 运算符的异常。所以7zip在 CPP\Common\NewHandler.cpp 重新定义了new operator
operator new(size_t size)
{
void *p = ::malloc(size);
if (p == 0)
throw CNewException();
return p;
}
如果你使用的MSCV版本支持new运算符的异常抛出,你在编译7zip时可以忽略"NewHandler.cpp"。 所以使用标准的异常。实际上7zip的部分代码捕获的任何异常都会转换为HRESULT码。如果你调用7zip的COM interface 就不需要捕获CNewException.
接口案例:
Look example code : C/Util/Lzma/LzmaUtil.c
cd C/Util/Lzma
make -j -f makefile.gcc
output: ./_o/7lzma
LZMA-C 22.01 (x64) : Igor Pavlov : Public domain : 2022-07-15
Usage: lzma <e|d> inputFile outputFile
e: encode file
d: decode file
参与贡献
https://sourceforge.net/p/sevenzip/_list/tickets