[PDB] Begin adding documentation for the PDB file format.

Differential Revision: https://reviews.llvm.org/D26374 git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@286491 91177308-0d34-0410-b5e6-96231b3b80d8
2024-11-28 14:10:41 +00:00 · 2016-11-10 19:24:21 +00:00 · 2016-11-10 19:24:21 +00:00 · ab792ca2d9
commit ab792ca2d9
parent e13ecb7d13
10 changed files with 306 additions and 0 deletions
--- a/docs/PDB/DbiStream.rst
+++ b/docs/PDB/DbiStream.rst
@ -0,0 +1,3 @@
+=====================================
+The PDB DBI (Debug Info) Stream
+=====================================
--- a/docs/PDB/GlobalStream.rst
+++ b/docs/PDB/GlobalStream.rst
@ -0,0 +1,3 @@
+=====================================
+The PDB Global Symbol Stream
+=====================================
--- a/docs/PDB/HashStream.rst
+++ b/docs/PDB/HashStream.rst
@ -0,0 +1,3 @@
+=====================================
+The TPI & IPI Hash Streams
+=====================================
--- a/docs/PDB/ModiStream.rst
+++ b/docs/PDB/ModiStream.rst
@ -0,0 +1,3 @@
+=====================================
+The Module Information Stream
+=====================================
--- a/docs/PDB/MsfFile.rst
+++ b/docs/PDB/MsfFile.rst
@ -0,0 +1,121 @@
+=====================================
+The MSF File Format
+=====================================
+
+.. contents::
+   :local:
+
+.. _msf_superblock:
+
+The Superblock
+==============
+At file offset 0 in an MSF file is the MSF *SuperBlock*, which is laid out as
+follows:
+
+.. code-block:: c++
+
+  struct SuperBlock {
+    char FileMagic[sizeof(Magic)];
+    ulittle32_t BlockSize;
+    ulittle32_t FreeBlockMapBlock;
+    ulittle32_t NumBlocks;
+    ulittle32_t NumDirectoryBytes;
+    ulittle32_t Unknown;
+    ulittle32_t BlockMapAddr;
+  };
+
+- **FileMagic** - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"``
+  followed by the bytes ``1A 44 53 00 00 00``.
+- **BlockSize** - The block size of the internal file system.  Valid values are
+  512, 1024, 2048, and 4096 bytes.  Certain aspects of the MSF file layout vary
+  depending on the block sizes.  For the purposes of LLVM, we handle only block
+  sizes of 4KiB, and all further discussion assumes a block size of 4KiB.
+- **FreeBlockMapBlock** - The index of a block within the file, at which begins
+  a bitfield representing the set of all blocks within the file which are "free"
+  (i.e. the data within that block is not used).  This bitfield is spread across
+  the MSF file at ``BlockSize`` intervals.
+  **Important**: ``FreeBlockMapBlock`` can only be ``1`` or ``2``!  This field
+  is designed to support incremental and atomic updates of the underlying MSF
+  file.  While writing to an MSF file, if the value of this field is `1`, you
+  can write your new modified bitfield to page 2, and vice versa.  Only when
+  you commit the file to disk do you need to swap the value in the SuperBlock
+  to point to the new ``FreeBlockMapBlock``.
+- **NumBlocks** - The total number of blocks in the file.  ``NumBlocks * BlockSize``
+  should equal the size of the file on disk.
+- **NumDirectoryBytes** - The size of the stream directory, in bytes.  The stream
+  directory contains information about each stream's size and the set of blocks
+  that it occupies.  It will be described in more detail later.
+- **BlockMapAddr** - The index of a block within the MSF file.  At this block is
+  an array of ``ulittle32_t``'s listing the blocks that the stream directory
+  resides on.  For large MSF files, the stream directory (which describes the
+  block layout of each stream) may not fit entirely on a single block.  As a
+  result, this extra layer of indirection is introduced, whereby this block
+  contains the list of blocks that the stream directory occupies, and the stream
+  directory itself can be stitched together accordingly.  The number of
+  ``ulittle32_t``'s in this array is given by ``ceil(NumDirectoryBytes / BlockSize)``.
+  
+The Stream Directory
+====================
+The Stream Directory is the root of all access to the other streams in an MSF
+file.  Beginning at byte 0 of the stream directory is the following structure:
+
+.. code-block:: c++
+
+  struct StreamDirectory {
+    ulittle32_t NumStreams;
+    ulittle32_t StreamSizes[NumStreams];
+    ulittle32_t StreamBlocks[NumStreams][];
+  };
+  
+And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes.
+Note that each of the last two arrays is of variable length, and in particular
+that the second array is jagged.  
+
+**Example:** Suppose a hypothetical PDB file with a 4KiB block size, and 4
+streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}.
+
+Stream 0: ceil(1000 / 4096) = 1 block
+
+Stream 1: ceil(8000 / 4096) = 2 blocks
+
+Stream 2: ceil(16000 / 4096) = 4 blocks
+
+Stream 3: ceil(9000 / 4096) = 3 blocks
+
+In total, 10 blocks are used.  Let's see what the stream directory might look
+like:
+
+.. code-block:: c++
+
+  struct StreamDirectory {
+    ulittle32_t NumStreams = 4;
+    ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000};
+    ulittle32_t StreamBlocks[][] = {
+      {4},
+      {5, 6},
+      {11, 9, 7, 8},
+      {10, 15, 12}
+    };
+  };
+  
+In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes``
+would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one
+``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``.
+
+Note also that the streams are discontiguous, and that part of stream 3 is in the
+middle of part of stream 2.  You cannot assume anything about the layout of the
+blocks!
+
+Alignment and Block Boundaries
+==============================
+As may be clear by now, it is possible for a single field (whether it be a high
+level record, a long string field, or even a single ``uint16``) to begin and
+end in separate blocks.  For example, if the block size is 4096 bytes, and a
+``uint16`` field begins at the last byte of the current block, then it would
+need to end on the first byte of the next block.  Since blocks are not
+necessarily contiguously laid out in the file, this means that both the consumer
+and the producer of an MSF file must be prepared to split data apart
+accordingly.  In the aforementioned example, the high byte of the ``uint16``
+would be written to the last byte of block N, and the low byte would be written
+to the first byte of block N+1, which could be tens of thousands of bytes later
+(or even earlier!) in the file, depending on what the stream directory says.
--- a/docs/PDB/PdbStream.rst
+++ b/docs/PDB/PdbStream.rst
@ -0,0 +1,3 @@
+========================================
+The PDB Info Stream (aka the PDB Stream)
+========================================
--- a/docs/PDB/PublicStream.rst
+++ b/docs/PDB/PublicStream.rst
@ -0,0 +1,3 @@
+=====================================
+The PDB Public Symbol Stream
+=====================================
--- a/docs/PDB/TpiStream.rst
+++ b/docs/PDB/TpiStream.rst
@ -0,0 +1,3 @@
+=====================================
+The PDB TPI Stream
+=====================================
--- a/docs/PDB/index.rst
+++ b/docs/PDB/index.rst
@ -0,0 +1,160 @@
+=====================================
+The PDB File Format
+=====================================
+
+.. contents::
+   :local:
+
+.. _pdb_intro:
+
+Introduction
+============
+
+PDB (Program Database) is a file format invented by Microsoft and which contains
+debug information that can be consumed by debuggers and other tools.  Since
+officially supported APIs exist on Windows for querying debug information from
+PDBs even without the user understanding the internals of the file format, a
+large ecosystem of tools has been built for Windows to consume this format.  In
+order for Clang to be able to generate programs that can interoperate with these
+tools, it is necessary for us to generate PDB files ourselves.
+
+At the same time, LLVM has a long history of being able to cross-compile from
+any platform to any platform, and we wish for the same to be true here.  So it
+is necessary for us to understand the PDB file format at the byte-level so that
+we can generate PDB files entirely on our own.
+
+This manual describes what we know about the PDB file format today.  The layout
+of the file, the various streams contained within, the format of individual
+records within, and more.
+
+We would like to extend our heartfelt gratitude to Microsoft, without whom we
+would not be where we are today.  Much of the knowledge contained within this
+manual was learned through reading code published by Microsoft on their `GitHub
+repo <https://github.com/Microsoft/microsoft-pdb>`__.
+
+.. _pdb_layout:
+
+File Layout
+===========
+
+.. toctree::
+   :hidden:
+   
+   MsfFile
+   PdbStream
+   TpiStream
+   DbiStream
+   ModiStream
+   PublicStream
+   GlobalStream
+   HashStream
+
+.. _msf:
+
+The MSF Container
+-----------------
+A PDB file is really just a special case of an MSF (Multi-Stream Format) file.
+An MSF file is actually a miniature "file system within a file".  It contains
+multiple streams (aka files) which can represent arbitrary data, and these
+streams are divided into blocks which may not necessarily be contiguously
+laid out within the file (aka fragmented).  Additionally, the MSF contains a
+stream directory (aka MFT) which describes how the streams (files) are laid
+out within the MSF.
+
+For more information about the MSF container format, stream directory, and
+block layout, see :doc:`MsfFile`.
+
+.. _streams:
+
+Streams
+-------
+The PDB format contains a number of streams which describe various information
+such as the types, symbols, source files, and compilands (e.g. object files)
+of a program, as well as some additional streams containing hash tables that are
+used by debuggers and other tools to provide fast lookup of records and types
+by name, and various other information about how the program was compiled such
+as the specific toolchain used, and more.  A summary of streams contained in a
+PDB file is as follows:
+
+--------------------+------------------------------+-------------------------------------------+
+| Name               | Stream Index                 | Contents                                  |
+====================+==============================+===========================================+
+| Old Directory      | - Fixed Stream Index 0       | - Previous MSF Stream Directory           |
+--------------------+------------------------------+-------------------------------------------+
+| PDB Stream         | - Fixed Stream Index 1       | - Basic File Information                  |
+|                    |                              | - Fields to match EXE to this PDB         |
+|                    |                              | - Map of named streams to stream indices  |
+--------------------+------------------------------+-------------------------------------------+
+| TPI Stream         | - Fixed Stream Index 2       | - CodeView Type Records                   |
+|                    |                              | - Index of TPI Hash Stream                |
+--------------------+------------------------------+-------------------------------------------+
+| DBI Stream         | - Fixed Stream Index 3       | - Module/Compiland Information            |
+|                    |                              | - Indices of individual module streams    |
+|                    |                              | - Indices of public / global streams      |
+|                    |                              | - Section Contribution Information        |
+|                    |                              | - Source File Information                 |
+|                    |                              | - FPO / PGO Data                          |
+--------------------+------------------------------+-------------------------------------------+
+| IPI Stream         | - Fixed Stream Index 4       | - CodeView Type Records                   |
+|                    |                              | - Index of IPI Hash Stream                |
+--------------------+------------------------------+-------------------------------------------+
+| /LinkInfo          | - Contained in PDB Stream    | - Unknown                                 |
+|                    |   Named Stream map           |                                           |
+--------------------+------------------------------+-------------------------------------------+
+| /src/headerblock   | - Contained in PDB Stream    | - Unknown                                 |
+|                    |   Named Stream map           |                                           |
+--------------------+------------------------------+-------------------------------------------+
+| /names             | - Contained in PDB Stream    | - PDB-wide global string table used for   |
+|                    |   Named Stream map           |   string de-duplication                   |
+--------------------+------------------------------+-------------------------------------------+
+| Module Info Stream | - Contained in DBI Stream    | - CodeView Symbol Records for this module |
+|                    | - One for each compiland     | - Line Number Information                 |
+--------------------+------------------------------+-------------------------------------------+
+| Public Stream      | - Contained in DBI Stream    | - Public (Exported) Symbol Records        |
+|                    |                              | - Index of Public Hash Stream             |
+--------------------+------------------------------+-------------------------------------------+
+| Global Stream      | - Contained in DBI Stream    | - Global Symbol Records                   |
+|                    |                              | - Index of Global Hash Stream             |
+--------------------+------------------------------+-------------------------------------------+
+| TPI Hash Stream    | - Contained in TPI Stream    | - Hash table for looking up TPI records   |
+|                    |                              |   by name                                 |
+--------------------+------------------------------+-------------------------------------------+
+| IPI Hash Stream    | - Contained in IPI Stream    | - Hash table for looking up IPI records   |
+|                    |                              |   by name                                 |
+--------------------+------------------------------+-------------------------------------------+
+
+More information about the structure of each of these can be found on the
+following pages:
+   
+:doc:`PdbStream`
+   Information about the PDB Info Stream and how it is used to match PDBs to EXEs.
+
+:doc:`TpiStream`
+   Information about the TPI stream and the CodeView records contained within.
+
+:doc:`DbiStream`
+   Information about the DBI stream and relevant substreams including the Module Substreams,
+   source file information, and CodeView symbol records contained within.
+
+:doc:`ModiStream`
+   Information about the Module Information Stream, of which there is one for each compilation
+   unit and the format of symbols contained within.
+
+:doc:`PublicStream`
+   Information about the Public Symbol Stream.
+
+:doc:`GlobalStream`
+   Information about the Global Symbol Stream.
+
+:doc:`HashStream`
+   Information about the Hash Table stream, and how it can be used to quickly look up records
+   by name.
+
+CodeView
+========
+CodeView is another format which comes into the picture.  While MSF defines
+the structure of the overall file, and PDB defines the set of streams that
+appear within the MSF file and the format of those streams, CodeView defines
+the format of **symbol and type records** that appear within specific streams.
+Refer to the pages on `CodeView Symbol Records` and `CodeView Type Records` for
+more information about the CodeView format.
--- a/docs/index.rst
+++ b/docs/index.rst
@ -274,6 +274,7 @@ For API clients and LLVM developers.
   Coroutines
   GlobalISel
   XRay
+   PDB/index

 :doc:`WritingAnLLVMPass`
   Information on how to write LLVM transformations and analyses.
@ -398,6 +399,9 @@ For API clients and LLVM developers.
 :doc:`XRay`
  High-level documentation of how to use XRay in LLVM.

+:doc:`The Microsoft PDB File Format <PDB/index>`
+  A detailed description of the Microsoft PDB (Program Database) file format.
+
 Development Process Documentation
 =================================