You have several options. All these methods work on Linux as well as
on Windows or Mac OS X. However, be aware that most PDFs do not include
to full, complete fontface when they have a font embedded. Mostly they
include just the
subset of glyphs used in the document.
Using pdftops
One of the most frequently used methods to do this on *nix systems consists of the following steps:
- Convert the PDF to PostScript, for example by using XPDF's
pdftops
(on Windows: pdftops.exe
helper program.
- Now fonts will be embedded in
.pfa
(PostScript) format + you can extract them using a text editor.
- You may need to convert the
.pfa
(ASCII) to a .pfb
(binary) file using the t1utils
and pfa2pfb
.
- In PDFs there are never
.pfm
or .afm
files
(font metric files) embedded (because PDF viewer have internal
knowledge about these). Without these, font files are hardly usable in a
visually pleasing way.
Using fontforge
Another method is to use the Free font editor
FontForge:
- Use the "Open Font" dialogbox used when opening files.
- Then select "Extract from PDF" in the filter section of dialog.
- Select the PDF file with the font to be extracted.
- A "Pick a font" dialogbox opens -- select here which font to open.
Check the FontForge manual. You may need to follow a few specific
steps which are not necessarily straightforward in order to save the
extracted font data as a file which is re-usable.
Using mupdf
Next,
MuPDF. This application comes with a utility called
pdfextract
(on Windows:
pdfextract.exe
) which can extract fonts and images from PDFs. (In case you don't know about MuPDF, which still is relatively unknown and new:
"MuPDF is a Free lightweight PDF viewer and toolkit written in portable C.", written by Artifex Software developers, the same company that gave us Ghostscript.)
(Update: Newer versions of MuPDF have moved the former functionality of 'pdfextract' to the command 'mutool extract'. Download it here: mupdf.com/downloads)
Note:
pdfextract.exe
is a command-line program. To use it, do the following:
c:\> pdfextract.exe c:\path\to\filename.pdf # (on Windows)
$> pdfextract /path/tofilename.pdf # (on Linux, Unix, Mac OS X)
This command will dump all of the extractable files from the pdf file
referenced into the current directory. Generally you will see a variety
of files: images as well as fonts. These include PNG, TTF, CFF, CID,
etc. The image names will be like
img-0412.png if the PDF object number of the image was 412. The fontnames will be like
FGETYK+LinLibertineI-0966.ttf, if the font's PDF object number was 966.
CFF (
Compact Font Format) files are a recognized format that
can be converted to other formats via a variety of converters for use
on different operating systems.
Again: be aware that most of these font files may have only a
subset of characters and may not represent the complete typeface.
Update: (Jul 2013) Recent versions of
mupdf
have seen an internal reshuffling and renaming of their binaries, not
just once, but several times. The main utility used to be a 'swiss
knife'-alike binary called
mubusy
(name inspired by busybox?), which more recently was renamed to
mutool
. These support the sub-commands
info
,
clean
,
extract
,
poster
and
show
.
Unfortunatey, the official documentation for these tools isn't up to
date (yet). If you're on a Mac using 'MacPorts': then the utility was
renamed in order to avoid name clashes with other utilities using
identical names, and you may need to use
mupdfextract
.
To achieve the (roughly) equivalent results with
mutool
as its previous tool
pdfextract
did, just run
mubusy extract ...
.*
So to extract fonts and images, you may need to run one of the following commandlines:
c:\> mutool.exe extract filename.pdf # (on Windows)
$> mutool extract filename.pdf # (on Linux, Unix, Mac OS X)
Downloads are here:
mupdf.com/downloads
Using gs
(Ghostscript)
Then,
Ghostscript can also extract fonts directly from PDFs. However, it needs the help of a special utility program named
extractFonts.ps
, written in PostScript language, which is available from the
Ghostscript source code repository.
Now use it, you need to run both, this file
extractFonts.ps
and your PDF file. Ghostscript will then use the instructions from the
PostScript program to extract the fonts from the PDF. It looks like this
on Windows (yes, Ghostscript understands the 'forward slash', /, as a
path separator also on Windows!):
gswin32c.exe ^
-q -dNODISPLAY ^
c:/path/to/extractFonts.ps ^
-c "(c:/path/to/your/PDFFile.pdf) extractFonts quit"
or on Linux, Unix or Mac OS X:
gs \
-q -dNODISPLAY \
/path/to/extractFonts.ps \
-c "(/path/to/your/PDFFile.pdf) extractFonts quit"
I've tested the Ghostscript method a few years ago. At the time it
did extract *.ttf (TrueType) just fine. I don't know if other font types
will also be extracted at all, and if so, in a re-usable way. I don't
know if the utility does block extracting of fonts which are marked as
protected.
Using pdf-parser.py
Finally, Didier Stevens'
pdf-parser.py: this one is probably not as easy to use, because you need to have some know-how about internal PDF structures.
pdf-parser.py
is a Python script which can do a lot of other things too. It can also
decompress and extract arbitrary streams from objects, and therefore it
can extract embedded font files too.
But you need to know what to look for. Let's see it with an example. I have a file named
big.pdf. As a first step I use the
-s
parameter to search the PDF for any occurrence of the keyword
FontFile (
pdf-parser.py
does not require a case sensitive search):
pdf-parser.py -s fontfile big.pdf
In my case, for my
big1.pdf, I get this result:
obj 9 0
Type: /FontDescriptor
Referencing: 15 0 R
<<
/Ascent 728
/CapHeight 716
/Descent -210
/Flags 32
/FontBBox [ -665 -325 2000 1006 ]
/FontFile2 15 0 R
/FontName /ArialMT
/ItalicAngle 0
/StemV 87
/Type /FontDescriptor
/XHeight 519
>>
obj 11 0
Type: /FontDescriptor
Referencing: 16 0 R
<<
/Ascent 728
/CapHeight 716
/Descent -210
/Flags 262176
/FontBBox [ -628 -376 2000 1018 ]
/FontFile2 16 0 R
/FontName /Arial-BoldMT
/ItalicAngle 0
/StemV 165
/Type /FontDescriptor
/XHeight 519
>>
It tells me that there are two instances of
FontFile2
inside the PDF, and these are in PDF objects no. 15 and no. 16, respectively. Object no. 15 holds the
/FontFile2
for font
/ArialMT, object no. 16 holds the
/FontFile2
for font
/Arial-BoldMT.
To show this more clearly:
pdf-parser.py -s fontfile big1.pdf | grep -i fontfile
/FontFile2 15 0 R
/FontFile2 16 0 R
A quick peeking into the PDF specification reveals the the keyword
/FontFile2
relates to a
'stream containing a TrueType font program' (
/FontFile
would relate to a
'stream containing a Type 1 font program' and
/FontFile3
would relate to a
'stream containing a font program whose format is specified by the Subtype entry in the stream dictionary' {hence being either a
Type1C or a
CIDFontType0C subtype}.)
To look specifically at PDF object no. 15 (which holds the font
/ArialMT), one can use the
-o 15
parameter:
pdf-parser.py -o 15 big1.pdf
obj 15 0
Type:
Referencing:
Contains stream
<<
/Length1 778552
/Length 1581435
/Filter /ASCIIHexDecode
>>
This
pdf-parser.py
output tells us that this object
contains a stream (which it will not directly display) that has a length
of 1.581.435 Bytes and is encoded ( == "compressed") with
ASCIIHexEncode and needs to be decoded ( == "de-compressed" or
"filtered") with the help of the standard
/ASCIIHexDecode
filter.
To dump any stream from an object,
pdf-parser.py
can be called with the
-d dumpname
parameter. Let's do it:
pdf-parser.py -o 15 -d dumped-data.ext big1.pdf
Our extracted data dump will be in the file named
dumped-data.ext. Let's see how big it is:
ls -l dumped-data.ext
-rw-r--r-- 1 kurtpfeifle staff 1581435 Apr 11 00:29 dumped-data.ext
Oh look, it is 1.581.435 Bytes. We saw this figure in the previous
command's output. Opening this file with a text editor confirms that its
content is ASCII hex encoded data.
Opening the file with a font reading tool like
otfinfo
(this is a part of the
lcdf-typetools
package) will lead to some disappointment at first:
otfinfo -i dumped-data.ext
otfinfo: dumped-data.ext: not an OpenType font (bad magic number)
OK, this is because we did not (yet) let
pdf-parser.py
make use of its full magic: to dump a filtered, decoded stream. For this we have to add the
-f
parameter:
pdf-parser.py -o 15 -f -d dumped-data-decoded.ext big1.pdf
What's the size is this new file?
ls -l dumped-data-decoded.ext
-rw-r--r-- 1 kurtpfeifle staff 778552 Apr 11 00:39 dumped-data-decoded.ext
Oh, look: that exact number was also already stored in the PDF object no. 15 dictionary as the value for key
/Length1
...
What does
file
think it is?
file dumped-data-decoded.ext
dumped-data-decoded.ext: TrueType font data
What does
otfinfo
tell us about it?
otfinfo -i dumped-data-decoded.ext
Family: Arial
Subfamily: Regular
Full name: Arial
PostScript name: ArialMT
Version: Version 5.10
Unique ID: Monotype:Arial Regular:Version 5.10 (Microsoft)
Designer: Monotype Type Drawing Office - Robin Nicholas, Patricia Saunders 1982
Manufacturer: The Monotype Corporation
Trademark: Arial is a trademark of The Monotype Corporation.
Copyright: © 2011 The Monotype Corporation. All Rights Reserved.
License Description: You may use this font to display and print content as permitted by
the license terms for the product in which this font is included.
You may only (i) embed this font in content as permitted by the
embedding restrictions included in this font; and (ii) temporarily
download this font to a printer or other output device to help
print content.
Vendor ID: TMC
So Bingo!, we have a winner:
pdf-parser.py
did indeed
extract a valid font file for us. Given the size of this file (778.552
Bytes), it looks like this font had been embedded even
completely in the PDF...
We could rename it to
arial-regular.ttf and install it as such and happily make use of it.
Caveats:
- In any case you need to follow the license that
applies to the font. Some font licences do not allow free use and/or
distribution. Pirating fonts is like pirating any software or other
copyrighted material.
- Most PDFs which are in the wild out there do not
embed the full font anyway, but only subsets. Extracting a subset of a
font is only useful in a very limited scope, if at all.
Please do also read the following about Pros and (more) Cons regarding font extraction efforts:
0 nhận xét:
Đăng nhận xét