Last Tuesday, we gave you four PDFs and challenged you to identify the “redacted” author in each of them. Two were on the easier side, and two a bit more challenging. Were you able to get all the authors? Here are the answers and explanations for each.
This scanned letter from WWI has gone through OCR (optical character recognition) so that the image of typewritten letters has been translated into machine-readable text (meaning you can highlight text with your cursor). Although a black box was placed over the image of the author’s name, the text still exists below the box. A reader can select the obscured text with their cursor, copy, and paste it elsewhere to read. Or, if opened in Adobe Reader or Adobe Acrobat, the box can be simply moved aside.
See the original document in our Digital Collections: Arthur Bluethenthal to Davey and Arthur, Jan. 27, 1917. (Note that the OCR used in this blog example is lousy, but the one in the Digital Collections is higher quality.)
In Word, you can add black highlight to text, which is what we did here. As far as Word is concerned, the text itself hasn’t been changed, so when the document is converted to PDF, the text is transferred along with everything else. Like the example above, all someone needs to do is copy and paste the text.
This is the perfect way to redact content from a document: open the document in its original authoring application (in this case, Word), delete the text, and replace it with something else. You can replace the confidential information with “X”s or with a black box.
So, where is the author’s name? It’s in the metadata. Remember the first time you opened Word on your computer and it asked for your name? Each time you create a document in Word, it automatically inserts that name into the metadata. When the document is converted to PDF, Word passes that metadata along into the PDF document. You can open the PDF in Adobe Reader or Acrobat, go to Properties, and see that the author of this document was Jane Smith.
This was a tough one! If you tried opening this in Adobe Reader or Adobe Acrobat, you might have noticed that the black box couldn’t be grabbed or moved. The trick? You need to open the file in Photoshop, its original authoring application. From there, you can easily move the black box aside.
There are lots of good ways to redact in Photoshop, but if you’re not careful, you might get a file like this one. In Photoshop, the black box was added as a second layer, and then the file was saved as a PDF using Photoshop’s default settings. These default settings are designed to lower the risk of accidentally losing information…which is exactly NOT the goal of redaction. Make sure you always flatten your image (merge the layers) before saving to PDF. This deletes any hidden pixels, leaving only the pixels that are actually visible.
For more tips and resources about redacting from PDFs, see last Tuesday’s post. It has links to some useful NSA guidelines, information about confidential records in North Carolina, and a few good general pointers about redaction and PDF documents.