Become a Sitecore PDF Ninja

I’m going to start this off by saying PDFs are evil and if you can avoid using them, i implore you to avoid at all costs.  It will inevitably lead to lots of frustration.

In our world today PDFs are incredibly prevalent.  Seems like almost every organization has a collection of PDFs for download.  Users and corporations alike seem to have embraced the PDF completely, however that doesn’t change the fact that they are incredibly annoying to programatically and dynamically manage.

In a C# world you have two main choices for managing PDFs the first is ITextSharp.  However i didn’t look into this library much because i noticed it’s pricing model.  In a nutshell it’s free as long as whatever your building is completely open source.  I suspect the vast majority of Sitecore clients are closed source.  Unfortunately it also looks like there are a sizable amount of people who missed this fact and are stealing this library for commercial gain potentially opening themselves up for lawsuit.  Scary stuff, so i looked elsewhere.

I chose to instead focus on PdfSharp which is free for any situation.  They also have a tool called MigraDoc specifically for building PDFs which i found particularly handy.

I have already outlined a solution to make PDFs searchable in the Sitecore search index.  Here i’m going to share a few more tricks I’ve discovered.

Generating PDFs

If you want to generate a PDF out of markup you’re going to be out of luck as a generality as due to the dramatic differences in the medium (HTML being for screens, PDFs being for printing) you’re never going to get perfect.  I believe ITextSharp has a method to do this but PDFSharp does not.  I did see this workaround i thought was interesting and perhaps worth a try.

I chose to use MigraDoc which ended up being quite easy.  There are a few paradigm changes that you need to understand.

  1. There are no pixels in PDFs, measurements are in actual lengths (inches, centimeters, etc.)
  2. Each Page is it’s own entity that can have different widths, headers, footers, margins etc..
  3. There are element similar to most HTML elements such as paragraphs, headers, tables, etc…
  4. Each element has default settings for sizes and spacing that can be overridden on the individual basis.

Here is a sample of setting up default elements and page settings

		private static void PdfDocumentSetup(Document doc)
		{
			//Default text
			Style style = doc.Styles["Normal"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(12);
			//Body Text
			style = doc.Styles.AddStyle("Paragraph2", "Normal");
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(12);
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//Title
			style = doc.Styles["Heading1"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(45);
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//SubHeading
			style = doc.Styles["Heading2"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(16);
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//SubHeading
			style = doc.Styles["Heading3"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(20);
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//SubHeading
			style = doc.Styles["Heading4"];
			style.Font.Name = "Arial";
			style.Font.Size = Unit.FromPoint(16);
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//Column Heading
			style = doc.Styles["Heading5"];
			style.Font.Name = "Arial";
			style.Font.Size = Unit.FromPoint(20);
			style.Font.Color = Color.FromRgbColor(255, new Color(0, 128, 192));
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.SpaceBefore = 12;
			style.ParagraphFormat.PageBreakBefore = false;
			//Bullets
			style = doc.AddStyle("Bullets", "Normal");
			style.ParagraphFormat.LeftIndent = Unit.FromInch(1.25);
			// Underlined section heading
			style = doc.AddStyle("Heading3Underlined", "Heading3");
			style.ParagraphFormat.Borders.Bottom = new Border() { Width = Unit.FromMillimeter(1), Color = Colors.Black };
			doc.DefaultPageSetup.PageHeight = Unit.FromInch(11);
			doc.DefaultPageSetup.PageWidth = Unit.FromInch(8.5);
			doc.DefaultPageSetup.LeftMargin = Unit.FromInch(.5);
			doc.DefaultPageSetup.RightMargin = Unit.FromInch(.5);
			doc.DefaultPageSetup.FooterDistance = Unit.FromInch(.75);
			doc.DefaultPageSetup.HeaderDistance = Unit.FromInch(.75);
			doc.DefaultPageSetup.TopMargin = Unit.FromInch(1.5);
			doc.DefaultPageSetup.BottomMargin = Unit.FromInch(2);
		}

Aggregate PDFs

You might need to combine two PDFs or take a cover letter PDF and combine it with a generated portion of the PDF.  In my case i had to take a customized cover letter and prepend it to a table output of data.

PdfSharp makes this amazingly easy.  Simply open both PDF sources in PdfSharp.  In migradoc, you can do this by saving the generated PDF to stream then opening the stream in PdfSharp.

			MemoryStream ret = new MemoryStream();
			PdfDocumentRenderer renderer = new PdfDocumentRenderer(true) { Document = doc };
			renderer.RenderDocument();
			renderer.Save(ret, false);

The above code will take the Migradoc document (doc) and render it to a memory stream which can then be opened in PdfSharp

			//_sitecore is a Sitecore Item API abstraction service to allow testability
			var coverletter = PdfReader.Open(_sitecore.GetPdfCoverletterStream(item), PdfDocumentOpenMode.Import);
			var pdf = PdfReader.Open(doc); // this is our stream from above
			for (int i = 0; i < coverletter.PageCount; i++)
			{
				var newPage = coverletter.Pages[i];
				pdf.Pages.Insert(i, newPage);
			}
			MemoryStream ret = new MemoryStream();
			pdf.Save(ret, false);
			return ret;

You simply take each page from one document and insert it into the other then save the result in whatever way you need, stream for us.

Injecting and reading PDFs from Sitecore Media

Getting PDFs from Sitecore is easy. You can use the MediaManager to get the PDF stream like so.

		public Stream GetPdfStream(Item pdf)
		{
			return MediaManager.GetMedia(pdf).GetStream().Stream;
		}

Once you’ve made your modifications you can write your PDF to a Sitecore media item like so:

					using (new SecurityDisabler())
					{
						pdfItem.Editing.BeginEdit();
						pdfItem.Fields["Blob"].SetBlobStream(pdf);//our stream that we were working with
						pdfItem.Fields["Extension"].Value = "pdf";
						pdfItem.Fields["Mime Type"].Value = "application/pdf";
						pdfItem.Editing.EndEdit();
					}

Find and replace tokens inside a PDF

			var coverletter = PdfReader.Open(_sitecore.GetPdfStream(item), PdfDocumentOpenMode.Import);
			for (int i = 0; i < coverletter.PageCount; i++)
			{
				var newPage = coverletter.Pages[i];

				for (int j = 0; j < newPage.Contents.Elements.Count; j++)
				{
					PdfDictionary.PdfStream stream = newPage.Contents.Elements.GetDictionary(j).Stream;
					var inStream = stream.Value;
					StringBuilder stringStream = new StringBuilder();
					foreach (byte b in inStream)
						stringStream.Append((char)b);

					stringStream = stringStream.Replace("__day__", DateTime.Now.Day.ToString()).Replace("__month__", DateTime.Now.ToString("MMMM")).Replace("__year__", DateTime.Now.Year.ToString());

					newPage.Contents.Elements.GetDictionary(j).Stream.Value = Encoding.UTF8.GetBytes(stringStream.ToString());
				}
			}

In this approach we can see that if we convert the stream to a byte array it will contain all the characters used in this PDF. BEFORE YOU GO THINKING THIS WILL ALWAYS WORK (it likely won’t).  There are many things that need to be the case for this to work as seen here.

as far as i can figure, a sure fire way of this working isn’t possible, however if you can work with a standard PDF creation process from your content authors with some tweaks you can get this to work.