PDF file format: Basic structure [updated 2020] | Infosec Resources (2023)

We all know that there are a number of attacks where an attacker includes some shellcode in a PDF document. This shellcode uses some kind of vulnerability in how the PDF document is analyzed and presented to the user to execute malicious code on the targeted system.

The following image presents the number of vulnerabilities discovered in popular PDF reader Adobe Acrobat Reader DC, which was released in 2015 and became the only supported Acrobat Reader version after the end of support of Acrobat XI in October 2017. The number of vulnerabilities is increasing over the years. The most important vulnerabilities are the code execution vulnerabilities, which an attacker can use to execute arbitrary code on the target system (if the Acrobat Reader hasn’t been patched yet).

Figure 1: Adobe Acrobat Reader DC vulnerabilities

This is an important indicator that we should regularly update our PDF Reader, because the number of vulnerabilities discovered recently is quite daunting.

PDF file structure

Whenever we want to discover new vulnerabilities in software, we should first understand the protocol or file format in which we’re trying to discover new vulnerabilities. In our case, we should first understand the PDF file format in detail. In this article, we’ll take a look at the PDF file format and its internals.

PDF is a portable document format that can be used to present documents that include text, images, multimedia elements, web page links and more. It has a wide range of features. The PDF file format specification is publicly available here and can be used by anyone interested in PDF file format. There are almost 800 pages of the documentation for the PDF file format alone, so reading through that is not something to do on a whim.

PDF has more functions than just text: it can include images and other multimedia elements, be password protected, execute JavaScript and so on. The basic structure of a PDF file is presented in the picture below:

Figure 2: PDF structure

Every PDF document has the following elements:

Header

This is the first line of a PDF file and specifies the version number of the used PDF specification which the document uses. If we want to find that out, we can use the hex editor or simply use the xxd command as below:

[plain]
# xxd temp.pdf | head -n 1
0000000: 2550 4446 2d31 2e33 0a25 c4e5 f2e5 eba7 %PDF-1.3.%……
[/plain]

The temp.pdf PDF document uses the PDF specification 1.3. The ‘%’ character is a comment in PDF, so the above example actually presents the first and second line being comments, which is true for all PDF documents. The following bytes are taken from the output below: 2550 4446 2d31 2e33 0a25 c4e5 and correspond to the ASCII text “%PDF-1.3.%”. What follows are some ASCII characters that are using non-printable characters (note the ‘.’ dots), which are usually there to tell some of the software products that the file contains binary data and shouldn’t be treated as 7-bit ASCII text. Currently the version numbers are of the form 1.N, where the N is from range 0-7.

Body

In the body of the PDF document, there are objects that typically include text streams, images, other multimedia elements, etc. The Body section is used to hold all the document’s data being shown to the user.

xref table

This is the cross reference table, which contains contains the references to all the objects in the document. The purpose of a cross reference table is that it allows random access to objects in the file, so we don’t need to read the whole PDF document to locate the particular object. Each object is represented by one entry in the cross reference table, which is always 20 bytes long. Let’s show an example:

[plain]
xref
0 1
0000000023 65535 f
3 1
0000025324 00000 n
21 4
0000025518 00002 n
0000025632 00000 n
0000000024 00001 f
0000000000 00001 f
36 1
0000026900 00000 n
[/plain]

We can display the cross reference table of the PDF document by simply opening the PDF with a text editor and scrolling to the bottom of the document. In the example above, we can see that we have four subsections (note the four lines that only contain two numbers). The first number in those lines corresponds to the object number, while the second line states the number of objects in the current subsection. Each object is represented by one entry, which is 20 bytes long (including the CRLF).

The first 10 bytes are the object’s offset from the start of the PDF document to the beginning of that object. What follows is a space separator with another number specifying the object’s generation number. After that, there is another space separator, followed by a letter “f” or “n” to indicate whether the object is free or in use.

The first object has an ID 0 and always contains one entry with generation number 65535 that is at the head of the list of free objects (note the letter “f” that means free). The last object in the cross-reference table uses the generation number 0.

The second subsection has an object ID 3 and contains one element, the object 3 that starts at an offset 25324 bytes from the beginning of the document. The third subsection has four objects, the first of which has an ID 21 and starts at an offset 25518 from the beginning of the file. Other objects have the subsequent numbers 22, 23 and 24.

All objects are marked with either an “f” or “n” flag. Flag “f” means that the object may still be present in a file, but is marked free, so it shouldn’t be used. These objects contain a reference to the next free object and the generation number to be used if the object becomes valid again. The flag “n” is used to represent valid and used objects that contain the offset from the beginning of the file and the generation number of the object.

Note that the object zero points to the next free object in the table, object 23. Since object 23 is also free, it itself points to the next free object in the table, object 24. But object 24 is the last free object on the file, so it’s pointing back to object zero. If we represent the above cross-reference table with every object number, it would look as follows:

[plain]
xref
0 1
0000000023 65535 f
3 1
0000025324 00000 n
21 1
0000025518 00002 n
22 1
0000025632 00000 n
23 1
0000000024 00001 f
24 1
0000000000 00001 f
36 1
0000026900 00000 n
[/plain]

The generation number of the object is incremented when the object is freed, so if the object becomes valid again (changes the flag from ‘f’ to ‘n’), the generation number is still valid without having to increment it. The generation number of object 23 is 1, so if it becomes valid again, the generation number will still be 1, but if it’s removed again, the generation number would increase to 2.

Multiple subsections are usually present in PDF documents that have been incrementally updated, otherwise only one subsection starting with the number zero should be present.

Trailer

The PDF trailer specifies how the application reading the PDF document should find the cross-reference table and other special objects. All PDF readers should start reading a PDF from the end of the file. An example trailer is presented below:
trailer
<<
/Size 22
/Root 2 0 R
/Info 1 0 R
>>
startxref
24212
%%EOF
The last line of the PDF document contains the end of the “%%EOF” file string. Before the end of the file tag, there is a line with a startxref string that specifies the offset from beginning of the file to the cross-reference table. In our case the cross-reference table starts at offset 24212 bytes. Before that is a trailer string that specifies the start of the Trailer section. The contents of the trailer sections are embedded within the << and >> characters (this is a dictionary that accepts key-value pairs).

We can see that the trailer section defines several keys, each of them for a particular action. The trailer section can specify the following keys:

  • /Size [integer]: Specifies the number of entries in the cross-reference table (counting the objects in updated sections as well). The used number shouldn’t be an indirect reference.
  • /Prev [integer]: Specifies the offset from the beginning of the file to the previous cross-reference section, which is used if there are multiple cross-reference sections. The number should be a cross-reference.
  • /Root [dictionary]: Specifies the reference object for the document catalog object, which is a special object that contains various pointers to different kinds of other special objects (more about that later).
  • /Encrypt [dictionary]: Specifies the document’s encryption dictionary.
  • /Info [dictionary]: Specifies the reference object for the document’s information dictionary.
  • /ID [array]: Specifies an array of two-byte unencrypted strings that form a file identifier.
  • /XrefStm [integer]: Specifies the offset from the beginning of the file to the cross-reference stream in the decoded stream. This is only present in hybrid-reference files, which is specified if we would also like to open documents even if the applications don’t support compressed reference streams.

We must remember that the initial structure can be modified if we update the PDF document at a later time. The update usually appends additional elements to the end of the file.

Incremental updates

The PDF has been designed with incremental updates in mind, since we can append some objects to the end of the PDF file without rewriting the entire file. Because of this, changes to a PDF document can be saved quickly. The new structure of the PDF document can be seen in the picture below:

Figure 3: PDF structure

We can see that the PDF document still contains the original header, body, cross-reference table and the trailer. Additionally, there are also other body, cross-reference and trailer sections that were added to the PDF document. The additional cross-reference sections will contain only the entries for objects that have been changed, replaced or deleted. Deleted objects will stay in the file, but will be marked with an “f” flag. Each trailer needs to be terminated by the “%%EOF” tag and should contain the /Prev entry, which points to the previous cross-reference section.

In PDF versions 1.4 and higher, we can specify the version entry in the document’s catalog dictionary to override the default version from the PDF header.

Example

Let’s present a simple PDF example and analyze it. Let’s download a sample PDF document from here and analyze it. Upon opening this PDF document it looks as shown below:

Figure 4: PDF document sample

The cross-reference and trailer sections are presented in the picture below:

Figure 5: Cross-reference and trailer sections

The cross-reference section has been reduced for clarity. The cross-reference section contains one subsection that itself contains 223 objects. The trailer section starts at byte offset 50291, includes 223 objects where the root element points to object 221 and the info element points to object 222.

In the next section, we’ll take a look at the PDF structure’s basic data types.

PDF data types

The PDF document contains eight basic types of objects described below. These types are: booleans, numbers, strings, names, arrays, dictionaries, streams and the null object. Objects may be labeled so that they can be referenced by other objects. A labeled object is also called an indirect object.

Booleans

There are two keywords: true and false that represent the boolean values.

Numbers

There are two types of numbers in a PDF document: integer and real. An integer consists of one or more digits optionally preceded by a plus or minus sign. An example of integer objects may be seen below:

  • 123 +123 -123

The real value can be represented with one or more digits, with an optional sign and a leading, trailing or embedded decimal point (a period). An example of real numbers can be seen below:

  • 123.0 -123.0 +123.0 123. -.123

Names

The names in PDF documents are represented by a sequence of ASCII characters in the range 0x21 – 0x7E. The exception are the characters: %, (, ), <, >, [, ], {, }, / and #, which must be preceded by a slash. An alternative representation of the characters is with their hexadecimal equivalent, preceded by the character “#”. There is a limitation of the length of the name element, which may be only 127 bytes long.

When writing a name, a slash must be used to introduce a name; the slash is not part of the name but is a prefix indicating that what follows is a sequence of characters representing the name. If we want to use whitespace or any other special character as part of the name, it must be encoded with two-digit hexadecimal notation.

The examples of names can be seen in the table below:

Figure 6: PDF names (source)

Strings

Strings in a PDF document are represented as a series of bytes surrounded by parenthesis or angle brackets, but can be a maximum of 65535 bytes long. Any character may be represented by ASCII representation, and alternatively with octal or hexadecimal representations. Octal representation requires the character to be written in the form ddd, where ddd is an octal number. Hexadecimal representation required the character to be written in the form <dd>, where dd is a hexadecimal number.

An example of representing a string embedded in parentheses can be seen below:

(Video) The Best Way to Name Your Files (3-Step File Naming System)

  • (mystring)

An example of representing a string embedded in angle brackets can be seen below (the hexadecimal representation below is the same as above and it reads as “mystring”):

  • &lt;6d79737472696e67&gt;

We can also use special well-known characters when representing a string. Those are: n for new line, r for carriage return, t for horizontal tabulator, b for backspace, f for form feed, ( for left parenthesis, ) for right parenthesis and for backslash.

Arrays

Arrays in PDF documents are represented as a sequence of PDF objects, which may be of different types and enclosed in square brackets. This is why an array in a PDF document can hold any object types, like numbers, strings, dictionaries and even other arrays. An array may also have zero elements. An array is presented with a square bracket. An example of an array is presented below:

  • 123 123.0 true (mystring) /myname]

Dictionaries

Dictionaries in a PDF document are represented as a table of key/value pairs. The key must be the name object, whereas the value can be any object, including another dictionary. The maximum number of entries in a dictionary is 4096 entries. A dictionary can be presented with the entries enclosed in double angle brackets << and >>. An example of a dictionary is presented below:
&lt;&lt; /mykey1 123

/mykey2 0.123

/mykey3 &lt;&lt; /mykey4 true

/mykey5 (mystring)

&gt;&gt;

&gt;&gt;

Streams

A stream object is represented by a sequence of bytes and may be unlimited in length, which is why images and other big data blocks are usually represented as streams. A stream object is represented by a dictionary object followed by the keywords stream followed by newline and endstream.

An example of a stream object can be seen below:
&lt;&lt;

/Type /Page

/Length 23 0 R

/Filter /LZWDecode

&gt;&gt;

stream

endstream
All stream objects shall be indirect objects and the stream dictionary shall be a direct object. The stream dictionary specifies the exact number of bytes of the stream. After the data there should be a newline and the endstream keyword.

Common keywords used in all stream dictionaries are the following (note that the Length entry is mandatory):

  • Length: How many bytes of the PDF file are used for the stream’s data. If the stream contains a Filter entry, the Length shall specify the number of bytes of encoded data.
  • Type: The type of the PDF object that the dictionary describes.
  • Filter: The name of the filter that will be applied in processing the stream data. Multiple filters can be specified in the order in which they shall be applied.
  • DecodeParms: A dictionary or an array of dictionaries used by the filters specified by Filter. This value specifies the parameters that need to be passed to the filters when they are applied. This isn’t necessary if the filters use the default values.
  • F: Specifies the file containing the stream data.
  • FFilter: The name of the filter to be applied in processing the data found in the stream’s external file.
  • FDecodeParms: A dictionary or an array of dictionaries used by the filters specified by FFilter.
  • DL: Specifies the number of bytes in the decoded stream. This can be used if enough disk space is available to write a stream to a file.
  • N: The number of indirect objects stored in the stream.
  • First: The offset in the decoded stream of the first compressed object.
  • Extends: Specifies a reference to other object streams, which form an inheritance tree.

The stream data in the object stream will contain N pairs of integers, where the first integer represents the object number and the second integer represents the offset in the decoded stream of that object. The objects in object streams are consecutive and don’t need to be stored in increasing order relative to object number. The First entry in the dictionary identifies the first object in the object stream.

We shouldn’t store the following information in an object stream:

  • Stream objects
  • Objects with a generation number that is not equal to zero
  • Document’s encryption dictionary
  • Indirect object of the Length entry in object stream dictionary
  • Document catalog, linearization dictionary, page objects

In PDF 1.5, cross-reference information may be stored in a cross-reference stream instead of in a cross-reference table. Each cross-reference stream contains the information equivalent to the cross-reference table and trailer.

Null object

The null object is represented by a keyword “null.”

Indirect objects

First of all, we must know that any object in a PDF document can be labeled as an indirect object. This gives the object a unique object identifier, which other objects can use to reference the indirect object. An indirect object is a numbered object represented with keywords “obj” and “endobj.” The endobj must be present in its own line, but the obj must occur at the end of the object ID line, which is the first line of the indirect object. The object ID line consists of object number, generation number and keyword “obj.” An example of an indirect object is as follows:
2 1 obj

12345

endobj
In the example above, we’re creating a new indirect object, which holds the number 12345 object. By declaring an object an indirect object, we are able to use it in the PDF document cross-reference table and reuse it by any page, dictionary and so on in the document. Since every indirect object has its own entry in the cross-reference table, the indirect objects may be accessed very quickly.

The object identifier of the indirect object consists of two parts; the first part is an object number of the current indirect object. The indirect objects don’t need to be numbered sequentially in the PDF document. The second part is the generation number, which is set to zero for all objects in a newly-created file. This number is later incremented when the objects are updated.

We can refer to the indirect objects with indirect reference, which consists of the object number, the generation number and the keyword R. To reference the above indirect object, we must write something like below:

  • 2 1 R

If we’re trying to reference an undefined object, we’re actually referring to a null object.

Document structure

A PDF document consists of objects contained in the body section of a PDF file. Most of the objects in a PDF document are dictionaries. Each page of the document is represented by a page object, which is a dictionary that includes references to the page’s contents. Page objects are connected together and form a page tree, which is declared with an indirect reference in the document catalog.

The whole structure of the PDF document can be represented with the picture below [1]:

Figure 7: Structure of the PDF document (source)

In the picture above, we can see that the document catalog contains references to the page tree, outline hierarchy, article threads, named destinations and interactive form. We won’t go into details what each of those sections do, but we’ll present just the most important section, the Page Tree.

Document catalog

From the picture above, we can see that the Document Catalog is the root of the objects in the PDF document. We’ve already said that it is the /Root element in the Trailer PDF section that specifies the document catalog. The document catalog contains references to other objects that define the document’s contents. It also contains the information that declares how the document will be displayed on the screen. The entries in the document catalog are as follows:

  • /Type: The type of the PDF object the directory describes (in our case, this is Catalog, since this is the document catalog object).
  • /Version: The version of the PDF specification the document was built against.
  • /Extensions: Information about the developer extensions in this document.
  • /Pages: An indirect reference to the object that is the root of a document’s page tree.
  • /Dests: an indirect reference to the object that is the root of the named destinations object.
  • /Outlines: an indirect reference to the outline directory object that is the root of the document’s outline hierarchy.
  • /Threads: an indirect reference to the array of thread dictionaries that represent the document’s article threads.
  • /Metadata: an indirect reference to the metadata stream that contains metadata for the document.

There are many other entries that we can see being part of the document catalog, but won’t describe them here. The reader can take a look at our sources for details. An example of the document catalog is presented below:
1 0 obj

&lt;&lt; /Type /Catalog

/Pages 2 0 R

/PageMode /UseOutlines

/Outlines 3 0 R

&gt;&gt;

endobj

Page tree

The pages of the document are accessed through the page tree, which defines all the pages in the PDF document. The tree contains nodes that represent pages of the PDF document, which can be of two types: intermediate and leaf nodes. Intermediate nodes are also called page tree nodes, while the leaf nodes are called page objects.

The simplest page tree structure can consist of a single page tree node that references all of the page objects directly (so all of the page objects are leafs).

Each node in a page tree has to have the following entries:

  • /Type: The type of the PDF object this object describes (in our case it’s Pages, since we’re talking about page tree nodes).
  • /Parent: Should be present in all page tree nodes except in root, where this entry mustn’t be present. This entry specifies its parent.
  • /Kids: Should be present in all page tree nodes except in leafs and specifies all the child elements directly accessible from the current node.
  • /Count: Specifies the number of leaf nodes that are descendants of this node in the subsequent page tree.

We must remember that page tree doesn’t relate to anything in the PDF document, like pages or chapters.

A basic example of a page tree can be seen below:
2 0 obj

&lt;&lt; /Type /Pages

/Kids [ 4 0 R

10 0 R

24 0 R

]

(Video) SharePoint Document Library Tutorial

/Count 3

&gt;&gt;

endobj

4 0 obj

&lt;&lt; /Type /Page

&gt;&gt;

endobj

10 0 obj

&lt;&lt; /Type /Page

&gt;&gt;

endobj

24 0 obj

&lt;&lt; /Type /Page

&gt;&gt;

endobj
The page tree above defines the Root object with the ID of 2, which has three children, objects 4, 10 and 20. We can also see that the leaves of the page tree are dictionaries specifying the attributes of a single page of the document. There are multiple attributes that we can use when defining them for each document page.

We’ve seen the basic structure of the PDF document and it’s data types. If we want to start finding vulnerabilities in PDF readers, we need to change the PDF document in such a way that the PDF reader won’t be able to handle it and crash. Usually, if we can get the PDF reader to crash, we’ve discovered a security vulnerability, which we can use to execute arbitrary code on the target machine.

An example

In this article we’ll take a look at a very simple example of a PDF document. First we need to create the PDF document so that we’ll then try to analyze it. To create a PDF document, let’s first create a very simple .tex document that contains what can be seen in the picture below:

Figure 8: Simple document

We can see that the .tex document doesn’t really contain much. First, we’re defining the document to be an article and then including the contents of the article inside the begin and end document. We’re including a new section with a title (Introduction) and including the static text “Hello World!”.

We can compile the .tex document into the PDF document with the pdflatex command and specifying the name of the .tex file as an argument. The resulting PDF then looks like this shown in the picture below:

Figure 9: Result

We can see that the PDF document really doesn’t contain very much, only the text we’ve actually included and no pictures, JavaScript or other elements.

Example 1

Let’s take a look at the PDF document structure, which is presented in the output below:
%PDF-1.5

%ÐÔÅØ

3 0 obj &lt;&lt;

/Length 138

/Filter /FlateDecode

&gt;&gt;

stream

endstream

endobj

10 0 obj &lt;&lt;

/Length1 1526

/Length2 7193

/Length3 0

/Length 8194

/Filter /FlateDecode

&gt;&gt;

stream

endstream

endobj

12 0 obj &lt;&lt;

/Length1 1509

/Length2 9410

/Length3 0

(Video) How to format your paper in APA style in 2023

/Length 10422

/Filter /FlateDecode

&gt;&gt;

stream

endstream

endobj

15 0 obj &lt;&lt;

/Producer (pdfTeX-1.40.12)

/Creator (TeX)

/CreationDate (D:20121012175007+02’00’)

/ModDate (D:20121012175007+02’00’)

/Trapped /False

/PTEX.Fullbanner (This is pdfTeX, Version 3.1415926-2.3-1.40.12 (TeX Live 2011) kpathsea version 6.0.1)

&gt;&gt; endobj

6 0 obj &lt;&lt;

/Type /ObjStm

/N 10

/First 65

/Length 761

/Filter /FlateDecode

&gt;&gt;

stream

endstream

endobj

16 0 obj &lt;&lt;

/Type /XRef

/Index [0 17]

/Size 17

/W [1 2 1]

/Root 14 0 R

/Info 15 0 R

/ID [&lt;1DC2E3E09458C9B4BEC8B67F56B57B63&gt; &lt;1DC2E3E09458C9B4BEC8B67F56B57B63&gt;]

/Length 60

/Filter /FlateDecode

&gt;&gt;

stream

endstream

endobj

startxref

20215

%%EOF
There’s quite a lot of the necessary elements to create such a simple PDF document, so we can imagine how a really complicated PDF document would look. We also need to remember that all the encoded data streams were removed and replaced with three dots for clarity and brevity.

Let’s present each of the PDF sections. The header can be seen in the picture below:

Figure 10: PDF header

The body can be seen in the picture below:

Figure 11: PDF body

The xref section can be seen in the picture below:

(Video) The Best Way to Manage Files and Folders (ABC Method)

Figure 11: PDF xref

And last, the Trailer section is represented below:

Figure 12: PDF trailer

We presented all of the sections of the PDF document, but we still have to analyze them further. The header of the PDF document is standard and we don’t really need to talk about it, and let’s leave the body section for later.

This is why we must first take a look at the xref section. We can see that the offset from the beginning of the file to the xref table is 20215 bytes, which in hexadecimal form is 0x4ef7. If we take a look at the hexadecimal representation of the file as we can get with the xxd tool, we can see what’s presented in the picture below:

Figure 13: Hexadecimal representation of the file

The highlighted bytes lie exactly at the start of the offset 20125 bytes from the beginning of the file. The preceding 0x0a bytes is the new line and the current 0x31 bytes represents the number 1, which is exactly the start of the xref table. This is why the xref table is represented with an indirect object with an ID 16 and generation number 0. (This should be the case for all objects, since we just created the PDF document and none of the objects have been changed yet. If we look at the whole PDF document we can see that this is clearly true; all objects have a generation number zero.)

The /Type of the indirect object classifies this as an xref table. The /Index array contains a pair of integers for each subsection in this section. The first integer specifies the first object number in the subsection and the second integer specifies the number of entries in the subsection. In our example, the object number is zero and there are 17 entries in this subsection. This is also specified by the /Size directive. Note that this number is one larger than the largest number of any object number in the subsection. The /W attribute specifies an array of integers representing the size of the fields in cross-reference entry means that the fields are one byte, two bytes and one byte.

After that, there is the /Root element that specifies the catalog directory for the PDF document to be object number 14. The /Info is the PDF document’s information directory that is contained in object number 15. The /ID array is required because the Encrypt entry is present and contains two strings that constitute a file identifier. Those two strings are used as input to the encryption algorithm.

The /Length specifies the length of the encryption key in bits; the value should be multiple of 8 in the range 40 t o128 (default value is 40). In our case, the length of the encryption key is 60 bits. The /Filter specifies the name of the security handler for this document; this is also the security handler that was used to encrypt the document. In our case, this is FlateDecode, which encodes the data using zlib/deflate compression method.

We can see that the other part of the xref table is compressed, so we can’t really read that. We could, of course, apply some zlib decompression algorithm over the compressed data, but there’s a better option. Why would we write a program for that if a tool already exists? With pdftk, we can repair a PDF’s corrupted xref table with the following command:

  • # pdftk in.pdf output out.pdf

After that, the out.pdf file contains the following xref and trailer sections:

Figure 14: xref and trailer

Clearly, the /Root and /Info object numbers have changed and other stuff as well, but we got the trailer and xref keywords that define the xref table. We can see that there are 14 objects in the xref table.

We could go on and try to decode other sections as well, but this is out of the scope of this article. Next, we’ll check the document that isn’t encoded.

Example 2

Let’s take a look at the sample PDF document that is accessible here. Some of the stream objects are encrypted, but aren’t so important now. Since we already know how to handle PDF documents, we won’t lose too many words on simple stuff.

Let’s open that PDF in a text editor like gvim and check out the trailer section. We must know by now that all PDF documents should be read from the end to the start. The trailer is represented in the picture below:

Figure 15: PDF trailer

Let’s also present the Xref with just a few objects (the rest of them were discarded for clarity):

Figure 16: PDF xref

We can see that the /Root of the PDF document is contained in the object with ID 221 and there is additional information in object 222. The object 221 is the most important object in the whole document, so let’s present it:

Figure 17: Object 221

We can see that the object is indeed the Document Catalog. The Page Tree object is 212, the Outlines object is 213, the Names object is 220 and the OpenAction object is 58. We haven’t talked about any other types than the Page Tree object, so we’ll continue with the Page Tree talk only.

The Page Tree object with an ID 212 is represented in the picture below:

Figure 18: Page Tree object

So the 212 object contains the actual pages of the PDF document. It contains 10 pages, which is exactly right (we can check this out if we open the PDF file with any PDF reader and check the number of pages).

We know that the Kids attribute specifies all the child elements directly accessible from the current node. In our case, there are two direct child nodes with object IDs 66 and 135. Object 66 is presented below:

Figure 19: Object 66

Object 66 contains other child elements with ID 57, 69, 75, 97, 108 and 120.

Figure 20: Object 135

Object 135 further defines objects 129, 138, 133 and 158.

If we count all the elements, we can see that there are exactly 10 elements, which means 10 pages out of 10 pages. This further implies that all of the presented objects are in fact the actual pages of the PDF document and don’t contain any further children nodes.

All of the presented objects are declared similarly, so we won’t look at each of the objects in turn. Instead, we’ll just take a look at one object, namely the object 57. Object 57 contains is declared as follows:

Figure 21: Object 57

We can see that the object’s type is /Page, which directly implies that this is a leaf node that presents one of the pages of the PDF document. The contents of that PDF page can be found in an object 62:

Figure 22: Object 62

We can see that the actual content of the PDF page is encoded with the FlateDecode, which is just a simple zlib encoding algorithm.

Conclusion

We’ve seen two examples of how PDF documents can be constructed. With the knowledge we obtained, we can start generating incorrect PDF documents and feeding them to the various PDF readers. In case that a certain PDF reader crashes while reading a certain PDF document, that document contains something that the PDF reader couldn’t handle. This implies the possibility of a vulnerability, which would need to be studied further.

At the end, if the vulnerability proves to be present, we can even write a PDF document that contains malicious code that is executed when the victim opens the PDF document with the vulnerable PDF reader on their target machine. In such cases, the whole machine might be compromised, since arbitrary malicious code can be executed just by opening a malicious PDF document.

Sources

Vulnerability Statistics, CVE Details

Adobe Support Policies: Supported Product Versions, Adobe

(Video) Understanding the File And Folder Structure Of Your Mac

Document management — Portable document format — Part 1: PDF 1.7, Adobe (Archive.org)

References:

[1]: The PDF File Format, accessible on: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf.

FAQs

What is the basic structure of a PDF file? ›

The PDF document contains eight basic types of objects described below. These types are: booleans, numbers, strings, names, arrays, dictionaries, streams and the null object. Objects may be labeled so that they can be referenced by other objects. A labeled object is also called an indirect object.

How do you fix an incorrect structure was found in the PDF? ›

  1. Launch Acrobat, but do not open any files.
  2. Choose File > Create > Combine Files into a Single PDF.
  3. Click Add Files.
  4. Navigate to the PDF that's causing issues, select it and click Add Files.
  5. Click Combine, Binder1. pdf is created.
  6. File > Save As.

Why my PDF format is not working? ›

Your PDF reader or preferred program is out of date and needs an update. Your PDF application is potentially damaged or needs to be rebooted. The PDF is potentially damaged or tampered with. A potential virus or malicious attack is embedded into a PDF file.

How do I change to PDF format? ›

You can use Word, PowerPoint, and OneNote for the Web to convert your doc into a PDF.
  1. Select File > Print > Print (in PowerPoint you will select one of three formats).
  2. In the dropdown menu under Printer , select Save as PDF and then select Save.

What is the most common PDF format? ›

The most popular and pervasive PDF format is the traditional PDF file.

What does incorrect structure in PDF file mean? ›

This is an indication of a corrupt PDF file. There really is not much you can do. Chances are that somebody used either a PDF generator that created corrupt PDF, or some tool was used on this file after it was created that corrupted the file.

How do I reset PDF properties? ›

Go to the "File" tab and choose the "Properties" > "Description" option. You can then view the metadata of the PDF document. To edit or delete metadata on PDF, select the information that you want to delete. Using the "Backspace" or "Delete" key on the keyboard, you can easily remove the metadata information.

Where do I change PDF settings? ›

[Android] How to clear a different PDF app from always opening my PDF documents?
  • Go to Settings.
  • Go to Apps.
  • Select the other PDF app, that always open up automatically.
  • Scroll down to "Launch By Default" or "Open by default".
  • Tap "Clear Defaults" (if this button is enabled).
Jan 25, 2022

Which PDF type should I use? ›

PDF - The general PDF standard used at offices and for sharing and viewing files online. PDF/A - This type is ideal for storing files for a long time. It is used by people who require long-term file storage, like records managers and archivists.

What is the difference between PDF file and a regular file? ›

Answer. PDF stands for "portable document format". Essentially, the format is used when you need to save files that cannot be modified but still need to be easily shared and printed. Today most devices have a version of Adobe Reader or can open a PDF in an Internet Browser.

How do I know if a PDF is readable? ›

It is easy to check your PDF files for OCR functionality. If you can perform a word search within your document, you can assume it is OCR-enabled. If you cannot perform a word search, or highlight a word by double-clicking in your text, then your document is an image, not a readable file.

Is it better to have all files in PDF or other format? ›

Text files often lose their formatting information when you open them on a different computer or device. PDFs retain all formatting, style, and image information from the source file. They always display correctly, no matter which device you use to view them.

What is another name for PDF file? ›

Portable Document Format (PDF)

Are there different types of PDF files? ›

Depending on the way the file originated, there are three main types of PDF documents. How the PDF was originally created defines whether the content of the PDF (text, images, tables) can be accessed or whether it is “locked” in an image of the page.

Why can't I open and edit a PDF? ›

The PDF document is encrypted with password protection. You may be using a PDF reader that only allows you to preview files, not edit them. The PDF was created using an outdated or ineffective program that makes editing the file difficult. The editing software you are using is too complicated.

How do I fix the formatting of a PDF in Word? ›

You can also set to keep the original file format.
...
Follow these steps:
  1. Open Acrobat, and click Edit=>Preferences.
  2. Access 'Convert from PDF', select the Word document.
  3. Select Edit settings (edit settings) =>Retain Page Layout(keep page layout intact).
  4. Click OK.
  5. Close and reopen Acrobat.
Feb 28, 2022

How do I remove all metadata from a PDF? ›

How to Remove Metadata from a PDF Using Adobe Acrobat
  1. Run Adobe Acrobat as an Administrator. ...
  2. When the program loads, go to “File” and select “Properties.”
  3. A window will appear. ...
  4. This will display the PDF's metadata. ...
  5. Choose to remove it, and then click “OK.”
Jan 8, 2022

How do I change a PDF back to editable? ›

How to make a PDF editable with Acrobat.
  1. Open your PDF file in Acrobat Pro.
  2. Click the Edit PDF tool on the right-hand panel. ...
  3. Use editing tools to add new text, edit text, or update fonts using selections from the Format drop-down list.
  4. Move, add, replace, or resize images using the tools in the Objects list.

How do I reset the default PDF program? ›

Procedure:
  1. Locate any . pdf file saved locally on your PC.
  2. Right Click the file.
  3. Select Open With...
  4. Choose the desired PDF reader.
  5. Tick the box next to Always use this app to open . pdf files.
  6. Click Ok.
Apr 2, 2023

How do I reset a file format? ›

How to Reset the Default App for a Certain File Type in Windows 11.
  1. Go to Start > Settings and select Apps on the left.
  2. Click Default apps on the right pane and then click Choose defaults by file type.
  3. A list of all the file types and the current default programs will be displayed.

How do you check if you have corrupted files? ›

  1. Click Start.
  2. In the search bar, type CMD .
  3. Right-click CMD.exe and select Run as Administrator.
  4. On the User Account Control (UAC) prompt, click Yes.
  5. In the command prompt window, type SFC /scannow and press Enter .
  6. System file checker utility checks the integrity of Windows system files and repairs them if required.
Jan 3, 2022

How do I fix corrupted or damaged files? ›

Read on to find out some of the quickest ways to fix corrupted files and recover them.
  1. 1 Restore Previous Versions:
  2. 2 Use System Restore.
  3. 3 Use the SFC /Scannow command.
  4. 4 Use DISM tool.
  5. 5 Use the CHKDSK command.
  6. 6 Perform SFC scan before Windows 10 starts.
  7. 7 Reset your Windows 10.

How do I convert an unreadable PDF to readable PDF? ›

Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF.

How do I change my PDF file attachment settings? ›

To set attachment preferences:
  1. Choose Preferences > Trust Manager.
  2. Configure Allow opening of non-PDF file attachments with external applications: Checked: Default.
Oct 12, 2022

How do I open a PDF without Adobe? ›

Many people use a PDF file reader such as Adobe to open PDF files online and view them. However, DocFly's PDF file opener provides another option for viewing PDF files without having to download any software. With our free PDF viewer online users can view PDF files from any computer with an Internet connection.

Where are PDF settings in Chrome? ›

You can choose whether PDFs download or open in Chrome when you go to a site.
  • On your computer, open Chrome.
  • At the top right, click More. Settings.
  • Click Privacy and security. Site Settings.
  • Click Additional content settings. PDF documents.
  • Choose the option that you want as your default setting.

What are three characteristics of a PDF format? ›

A PDF displays the exact same content and layout no matter which operating system, device or software application it is viewed on. The PDF format allows you to integrate various types of content – text, images and vector graphics, videos, animations, audio files, 3D models, interactive fields, hyperlinks, and buttons.

What is the description of a PDF file? ›

Portable Document Format (PDF) is a file format that has captured all the elements of a printed document as an electronic image that users can view, navigate, print or forward to someone else.

Is PDF a structured data format? ›

Unstructured data doesn't have any pre-defined structure to it and comes in all its diversity of forms. The examples of unstructured data vary from imagery and text files like PDF documents to video and audio files, to name a few.

Is PDF file structured or unstructured? ›

As mentioned, PDFs are an unstructured form of data. This is quite common. Unstructured data accounts for about 80% to 90% of data generated and collected by businesses. The challenge that this creates, however, is that the information they contain cannot be processed by software for further analysis.

What makes a valid PDF? ›

The probability density function must satisfy the following conditions: (i)f(x)≥0 for all x∈R,(ii)∞∫−∞f(x)dx=1. (i) f ( x ) ≥ 0 for all x ∈ R , (ii) ∫ − ∞ ∞ f ( x ) d x = 1 .

What is the difference between PDF and PDF A format? ›

PDF/A is an archival format of PDF that embeds all fonts used in the document within the PDF file. This means that a user of your file will not have to have the same fonts that you used to create the file installed on their computer to read the file.

Is a PDF an image or a file? ›

PDF — Portable Document Format

PDF isn't actually an image file format, but sometimes it can be a good idea to convert images into PDFs. You can open PDFs on any device and compress them to smaller file sizes without significantly affecting image quality.

How do I find details of a PDF? ›

View document metadata
  1. Choose File > Properties, and click the Additional Metadata button in the Description tab.
  2. Click Advanced to display all the metadata embedded in the document. (Metadata is displayed by schema—that is, in predefined groups of related information.)
May 17, 2021

What is an example of structured format? ›

Structured data conforms to a tabular format with relationship between the different rows and columns. Common examples of structured data are Excel files or SQL databases. Each of these have structured rows and columns that can be sorted.

How many types of PDF files are there? ›

There are a total of eight PDF standards; six are ISO Standards, and two are from other organizations.

What is structured data PDF? ›

Examples of structured data include a transaction record in a database, a PDF customer file in searchable text format, with metadata tags of customer name, customer number and date, a PDF remittance advice in searchable text format which includes customer name, amount, date and invoice reference.

What do you mean by file structure? ›

File structures are not fixed entities, but rather build a framework that communicates the function and purpose of elements within a project by separating concerns into a hierarchy of folders and using consistent, chronological, and descriptive names.

What is the difference between structured and unstructured data PDF? ›

1) Defined Vs Undefined Data

Structured data lives in columns and rows and it can be mapped into pre-defined fields. Unstructured data does not have a predefined data format, it is a collection of many types of varied data that are stored in their native formats.

Videos

1. How to Write a DBQ (Document Based Question) for APUSH, AP World, & AP Euro
(Heimler's History)
2. What is a File Format?
(LiveOverflow)
3. How to Make Org Charts in PowerPoint, Word, Teams, Excel & Visio
(Kevin Stratvert)
4. Qt Tutorials For Beginners 13 - Resource Collection Files (.qrc)
(ProgrammingKnowledge)
5. The Best Way to Organize Your Computer Files
(Thomas Frank)
6. AutoCAD Basic Tutorial for Beginners - Part 1 of 3
(SourceCAD)

References

Top Articles
Latest Posts
Article information

Author: Msgr. Refugio Daniel

Last Updated: 03/09/2023

Views: 5775

Rating: 4.3 / 5 (74 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Msgr. Refugio Daniel

Birthday: 1999-09-15

Address: 8416 Beatty Center, Derekfort, VA 72092-0500

Phone: +6838967160603

Job: Mining Executive

Hobby: Woodworking, Knitting, Fishing, Coffee roasting, Kayaking, Horseback riding, Kite flying

Introduction: My name is Msgr. Refugio Daniel, I am a fine, precious, encouraging, calm, glamorous, vivacious, friendly person who loves writing and wants to share my knowledge and understanding with you.