The third forensics challenge starts with the following text:
see or do not see
Written by marc
Hacking PDFs, what fun!
In general, when dealing with reverse-engineering malicious documents, we follow these steps:
1. We search for malicious embedded code (shell code, JavaScript).
2. We extract any suspicious code segments
3. If we see shellcode, we disassemble or debug it. If we see JavaScript (or ActionScript or VB macro code), we try to examine it.
However, this problem turned out to be very simple...
Finding the Flag in 10 Seconds
Yeap, this easy:
1. Download the PDF file.
2. Open it in any PDF viewer.
3. CTRL+A (select all the content).
4. You see the flag!
OK, we were lucky. Keep reading if you think this was too easy.
Analyzing the ID and the Streams in a PDF File
Let's suppose we had no clue that the flag would just be a text in the file. In this case, we would want to examine the file's structure. For this task, we use the PDF Tool suite, which is written in Python.
pdfid
We start with pdfid.py, which parses the PDF looking for certain keywords. We download and unzip that script, and then we make it an executable:
$ unzip pdfid_v0_1_2.zip
$ chmod a+x pdfid.py
Running over our file gives:
$ ./pdfid.py pdf.pdf
PDFiD 0.1.2 pdf.pdf
PDF Header: %PDF-1.3
obj 20
endobj 19
stream 10
endstream 10
xref 1
trailer 1
startxref 1
/Page 1
/Encrypt 0
/ObjStm 0
/JS 0
/JavaScript 0
/AA 0
/OpenAction 0
/AcroForm 0
/JBIG2Decode 0
/RichMedia 0
/Launch 0
/EmbeddedFile 0
/XFA 0
/Colors > 2^24 0
All right, no funny stuff going on here. We need to look deeper into each of these streams.
pdf-parser
We download pdf-parser.py, which is used to search for all the fundamental elements in a PDF file. Let's take a closer look:
$ unzip pdf-parser_V0_4_3.zip
$ chmod a+x pdf-parser.py
$ ./pdf-parser.py
Usage: pdf-parser.py [options] pdf-file|zip-file|url
pdf-parser, use it to parse a PDF document
Options:
--version show program's version number and exit
-s SEARCH, --search=SEARCH
string to search in indirect objects (except streams)
-f, --filter pass stream object through filters (FlateDecode,
ASCIIHexDecode, ASCII85Decode, LZWDecode and
RunLengthDecode only)
-o OBJECT, --object=OBJECT
id of indirect object to select (version independent)
-r REFERENCE, --reference=REFERENCE
id of indirect object being referenced (version
independent)
-e ELEMENTS, --elements=ELEMENTS
type of elements to select (cxtsi)
-w, --raw raw output for data and filters
-a, --stats display stats for pdf document
-t TYPE, --type=TYPE type of indirect object to select
-v, --verbose display malformed PDF elements
-x EXTRACT, --extract=EXTRACT
filename to extract malformed content to
-H, --hash display hash of objects
-n, --nocanonicalizedoutput
do not canonicalize the output
-d DUMP, --dump=DUMP filename to dump stream content to
-D, --debug display debug info
-c, --content display the content for objects without streams or
with streams without filters
--searchstream=SEARCHSTREAM
string to search in streams
--unfiltered search in unfiltered streams
--casesensitive case sensitive search in streams
--regex use regex to search in streams
Very interesting! We run it with our file, searching for the string /ProcSet:
$ ./pdf-parser.py pdf.pdf | grep /ProcSet
/ProcSet [ /ImageC /Text /PDF /ImageI /ImageB ]
Awesome! Even though we don't see any text in the file (when we opened it in the PDF viewer), there is text somewhere!
Getting Text from PDF
A good way to extract text from a pdf is using pdftotext:
$ pdftotext pdf.pdf
You should get a pdf.txt
file. Reading it with Linux's commands cat
or strings
gives you the flag:
$ strings pdf.txt
flag{security_through_obscurity}
As a note, there are several other PDF forensics tools that are worth to be mentioned: Origami (pdfextract extracts JavaScript from PDF files), PDF Stream Dumper (several PDF analysis tools), Peepdf (command-line shell for examining PDF), PDF X-RAY Lite (creates an HTML report with decoded file structure and contents), SWF mastah (extracts SWF objects), Pyew.
Hack all the things!