Become a Sitecore PDF Ninja

I’m going to start this off by saying PDFs are evil and if you can avoid using them, i implore you to avoid at all costs.  It will inevitably lead to lots of frustration.

In our world today PDFs are incredibly prevalent.  Seems like almost every organization has a collection of PDFs for download.  Users and corporations alike seem to have embraced the PDF completely, however that doesn’t change the fact that they are incredibly annoying to programatically and dynamically manage.

In a C# world you have two main choices for managing PDFs the first is ITextSharp.  However i didn’t look into this library much because i noticed it’s pricing model.  In a nutshell it’s free as long as whatever your building is completely open source.  I suspect the vast majority of Sitecore clients are closed source.  Unfortunately it also looks like there are a sizable amount of people who missed this fact and are stealing this library for commercial gain potentially opening themselves up for lawsuit.  Scary stuff, so i looked elsewhere.

I chose to instead focus on PdfSharp which is free for any situation.  They also have a tool called MigraDoc specifically for building PDFs which i found particularly handy.

I have already outlined a solution to make PDFs searchable in the Sitecore search index.  Here i’m going to share a few more tricks I’ve discovered.

Generating PDFs

If you want to generate a PDF out of markup you’re going to be out of luck as a generality as due to the dramatic differences in the medium (HTML being for screens, PDFs being for printing) you’re never going to get perfect.  I believe ITextSharp has a method to do this but PDFSharp does not.  I did see this workaround i thought was interesting and perhaps worth a try.

I chose to use MigraDoc which ended up being quite easy.  There are a few paradigm changes that you need to understand.

  1. There are no pixels in PDFs, measurements are in actual lengths (inches, centimeters, etc.)
  2. Each Page is it’s own entity that can have different widths, headers, footers, margins etc..
  3. There are element similar to most HTML elements such as paragraphs, headers, tables, etc…
  4. Each element has default settings for sizes and spacing that can be overridden on the individual basis.

Here is a sample of setting up default elements and page settings

		private static void PdfDocumentSetup(Document doc)
		{
			//Default text
			Style style = doc.Styles["Normal"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(12);
			//Body Text
			style = doc.Styles.AddStyle("Paragraph2", "Normal");
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(12);
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//Title
			style = doc.Styles["Heading1"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(45);
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//SubHeading
			style = doc.Styles["Heading2"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(16);
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//SubHeading
			style = doc.Styles["Heading3"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(20);
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//SubHeading
			style = doc.Styles["Heading4"];
			style.Font.Name = "Arial";
			style.Font.Size = Unit.FromPoint(16);
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//Column Heading
			style = doc.Styles["Heading5"];
			style.Font.Name = "Arial";
			style.Font.Size = Unit.FromPoint(20);
			style.Font.Color = Color.FromRgbColor(255, new Color(0, 128, 192));
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.SpaceBefore = 12;
			style.ParagraphFormat.PageBreakBefore = false;
			//Bullets
			style = doc.AddStyle("Bullets", "Normal");
			style.ParagraphFormat.LeftIndent = Unit.FromInch(1.25);
			// Underlined section heading
			style = doc.AddStyle("Heading3Underlined", "Heading3");
			style.ParagraphFormat.Borders.Bottom = new Border() { Width = Unit.FromMillimeter(1), Color = Colors.Black };
			doc.DefaultPageSetup.PageHeight = Unit.FromInch(11);
			doc.DefaultPageSetup.PageWidth = Unit.FromInch(8.5);
			doc.DefaultPageSetup.LeftMargin = Unit.FromInch(.5);
			doc.DefaultPageSetup.RightMargin = Unit.FromInch(.5);
			doc.DefaultPageSetup.FooterDistance = Unit.FromInch(.75);
			doc.DefaultPageSetup.HeaderDistance = Unit.FromInch(.75);
			doc.DefaultPageSetup.TopMargin = Unit.FromInch(1.5);
			doc.DefaultPageSetup.BottomMargin = Unit.FromInch(2);
		}

Aggregate PDFs

You might need to combine two PDFs or take a cover letter PDF and combine it with a generated portion of the PDF.  In my case i had to take a customized cover letter and prepend it to a table output of data.

PdfSharp makes this amazingly easy.  Simply open both PDF sources in PdfSharp.  In migradoc, you can do this by saving the generated PDF to stream then opening the stream in PdfSharp.

			MemoryStream ret = new MemoryStream();
			PdfDocumentRenderer renderer = new PdfDocumentRenderer(true) { Document = doc };
			renderer.RenderDocument();
			renderer.Save(ret, false);

The above code will take the Migradoc document (doc) and render it to a memory stream which can then be opened in PdfSharp

			//_sitecore is a Sitecore Item API abstraction service to allow testability
			var coverletter = PdfReader.Open(_sitecore.GetPdfCoverletterStream(item), PdfDocumentOpenMode.Import);
			var pdf = PdfReader.Open(doc); // this is our stream from above
			for (int i = 0; i < coverletter.PageCount; i++)
			{
				var newPage = coverletter.Pages[i];
				pdf.Pages.Insert(i, newPage);
			}
			MemoryStream ret = new MemoryStream();
			pdf.Save(ret, false);
			return ret;

You simply take each page from one document and insert it into the other then save the result in whatever way you need, stream for us.

Injecting and reading PDFs from Sitecore Media

Getting PDFs from Sitecore is easy. You can use the MediaManager to get the PDF stream like so.

		public Stream GetPdfStream(Item pdf)
		{
			return MediaManager.GetMedia(pdf).GetStream().Stream;
		}

Once you’ve made your modifications you can write your PDF to a Sitecore media item like so:

					using (new SecurityDisabler())
					{
						pdfItem.Editing.BeginEdit();
						pdfItem.Fields["Blob"].SetBlobStream(pdf);//our stream that we were working with
						pdfItem.Fields["Extension"].Value = "pdf";
						pdfItem.Fields["Mime Type"].Value = "application/pdf";
						pdfItem.Editing.EndEdit();
					}

Find and replace tokens inside a PDF

			var coverletter = PdfReader.Open(_sitecore.GetPdfStream(item), PdfDocumentOpenMode.Import);
			for (int i = 0; i < coverletter.PageCount; i++)
			{
				var newPage = coverletter.Pages[i];

				for (int j = 0; j < newPage.Contents.Elements.Count; j++)
				{
					PdfDictionary.PdfStream stream = newPage.Contents.Elements.GetDictionary(j).Stream;
					var inStream = stream.Value;
					StringBuilder stringStream = new StringBuilder();
					foreach (byte b in inStream)
						stringStream.Append((char)b);

					stringStream = stringStream.Replace("__day__", DateTime.Now.Day.ToString()).Replace("__month__", DateTime.Now.ToString("MMMM")).Replace("__year__", DateTime.Now.Year.ToString());

					newPage.Contents.Elements.GetDictionary(j).Stream.Value = Encoding.UTF8.GetBytes(stringStream.ToString());
				}
			}

In this approach we can see that if we convert the stream to a byte array it will contain all the characters used in this PDF. BEFORE YOU GO THINKING THIS WILL ALWAYS WORK (it likely won’t).  There are many things that need to be the case for this to work as seen here.

as far as i can figure, a sure fire way of this working isn’t possible, however if you can work with a standard PDF creation process from your content authors with some tweaks you can get this to work.

Sitecore Azure Search Issues

As soon as Sitecore 8.2 came out with a PAAS option i adopted immediately for a client. The main driving factor was that the client was a Microsoft shop and the idea of having a java search tool (SOLR) was a hard thing for them to swallow. They loved the idea of Azure Search and bought into it immediately.

I had my concerns about using a new technology in Sitecore but i decided to give it a try anyway. I found a number of trouble spots, both with Sitecore’s implementation as well as some limitations of the tool in general.

Sitecore API bugs

Sitecore Search API unable to query by datetimes.

If you’re using Sitecore’s API to access the search index, which is Sitecore’s recommended way to go, you are unable to query by datetimes.  So if you’re making an event search tool, you might want to either reconsider using Sitecore’s search API or go with a direct to Azure Search solution.

Cannot query for ID right after an app pool recycle

Immediately following an app pool recycle the Sitecore search api is unable to query by item ID.  This however can be alleviated if you go direct to the index.

Azure Search shortcomings

 

Azure Search only facets with AND logic never OR logic

A very common search scenario is to have the logic be OR within a particular grouping then and between groups.  For example, if you are looking for a new PC you might want to search for a PC that has an I7 processor and is in the price range of 600 – 1200 dollars, if you’re give range facets of 600 – 800, 800 – 1000, and 1000 – 1200.  Logically you would want to select all 3 options to have the range and processor you want.  However the faceting in Azure Search makes it not possible to do this, as soon as you select 600 – 800 all the other options will disappear as there can’t be computers that fall under 2 separate price ranges.

Hypothetically it would be possible to overcome this by making multiple queries to the index, however it would be a quite complex solution and increase the load on your index by potentially many times.

There is no ability to use wild cards in filters

The only wild cards that are accepted are in the text search query, not for filters.  Say for example you’re creating a search for restaurants and you want to give the user the ability to search for a city and text search.  You need to assume that the user typed the entire city name and not just a fragment in order to get any results.

As far as i can figure there isn’t a reasonable way to overcome this issue.

Recommendation

As it stands i would certainly recommend using SOLR in the cloud if you want to do any amount of work with the index.  A good cloud set up i have used is to set up an IAAS VM running SOLR and a virtual network into your PAAS Sitecore environment for fast connectivity.  Strangely enough, this also seems to cost less money than Azure Search.

Direct To Azure Search API

While i don’t recommend it at this time, the tool itself was quite easy to work with directly. Take a look at the documentation  to get started.

Here is an example of taking the connection string Sitecore uses and creating an Azure Search API object.

			ConnectionStringSettings search = ConfigurationManager.ConnectionStrings["cloud.search"];
			if (search == null)
				throw new Exception("Missing connection string for Azure Search");
			Dictionary<string, string> connStringParts = search.ConnectionString.Split(';')
	.Select(t => t.Split(new char[] { '=' }, 2))
	.ToDictionary(t => t[0].Trim(), t => t[1].Trim(), StringComparer.InvariantCultureIgnoreCase);
			try
			{
				SearchServiceClient client = new SearchServiceClient(new Uri(connStringParts["serviceUrl"]),
					new SearchCredentials(connStringParts["apiKey"]));
			//use or cache client
			}
			catch (Exception e)
			{
				throw new Exception("Unable to use connection string values", e);
			}

This is an example of using the index to set up a search with faceting, pagination, and sorting.  Note that i’m using a constants class to abstract away string literals.

			SearchParameters parameters = new SearchParameters
			{
				QueryType = QueryType.Simple,
				Skip = page * 10,
				IncludeTotalResultCount = true,
				Top = 10,
				SearchFields = new List<string>
				{
					AzureSearchConstants.FirstName,
					AzureSearchConstants.LastName,
					AzureSearchConstants.GroupName,
					AzureSearchConstants.LocationName,
					AzureSearchConstants.Address,
					AzureSearchConstants.City,
					AzureSearchConstants.County,
					AzureSearchConstants.State,
					AzureSearchConstants.Zip
				},
				Filter = facetQuery.ToString(),
				Facets = new List<string> { AzureSearchConstants.TypeDescription }
				OrderBy = new List<string> { AzureSearchConstants.LastName }
			};
			var index = _client.Indexes.GetClient(AzureSearchConstants.IndexName);
			var results = index.Documents.Search(query.Trim().Replace(" ", "* ") + '*', parameters);

No Speak Experience Profile Tab

For a client i was asked to collect some extra data and put it in Xdb and use it to personalize content.  The approach was simple create a facet like Pete Navarra outlines here.  Then build some custom rules like i outlined here.  Finally it was asked to add the collected data to the Experience profile.  That’s when the pain came.

Adam Conn has outlined how to do it in the official Sitecore way here.  As you can see the process involves building a Speak component for the tab, this is a long process and very tedious.  This lead me to report it as a large task, and the client wasn’t willing to take the extra time needed to get it all set up properly and the request was abandoned.

This lead me to think that there must be a better way, which i have found!

Enter the Experience Profile Express Tab

You can find the Nuget package here.  and the source code and developer documentation here.

This Module automates the construction of a speak component and wraps it around a proper MVC structure where you build a controller class to generate a model poco generated from the contact and pass it to a view.

For my first application of this module I built a tab to show Demandbase data collected by the Sitecore Demandbase module (which you can contact your Demandbase sales rep to acquire).

demandbasetab

This tab can be accomplished with a single C# class. First we take the data from the Demandbase facet which is a json object. We deserialize this to dictionary and dump it out to HTML.

	public class DemandbaseTab : EPExpressTab.Data.EpExpressModel
	{
		public override string RenderToString(Contact contact)
		{
			dynamic o = JsonConvert.DeserializeObject<ExpandoObject>(
				contact.GetFacet<IXdbFacetDemandbaseData>("Demandbase Data").DemandBaseData ?? "");
			StringBuilder sb = new StringBuilder();
			if (o == null)
				return "<div>Demandbase information not available.</div>";
			IDictionary<string, object> tst = (IDictionary<string, object>) o;
			bool even = false;
			foreach (string attr in tst.Keys)
			{
				if (tst[attr] is string)
				{
					sb.Append(
						$"
<div style='background-color:{(even ? "#fff" : "#eee")}'><span style='width:200px;display:inline-block;font-weight:bold;font-size:medium;'>{UppercaseWords(attr)}</span>{tst[attr]}</div>
");
					even = !even;
				}
			}
			return sb.ToString();
		}

		public override string Heading => "Demandbase Attributes";
		public override string TabLabel => "Demandbase";
		private string UppercaseWords(string value)
		{
			char[] array = value.ToCharArray();
			// Handle the first letter in the string.
			if (array.Length >= 1)
			{
				if (char.IsLower(array[0]))
				{
					array[0] = char.ToUpper(array[0]);
				}
			}
			// Scan through the letters, checking for spaces.
			// ... Uppercase the lowercase letters following spaces.
			for (int i = 1; i < array.Length; i++)
			{
				if (array[i - 1] == ' ')
				{
					if (char.IsLower(array[i]))
					{
						array[i] = char.ToUpper(array[i]);
					}
				}
				if (array[i] == '_')
				{
					array[i] = ' ';
				}
			}
			return new string(array);
		}
	}

TokenManager View Tokens

Likely fitting in the wheelhouse of most Sitecore developers is building a view model and passing it to a view to be rendered.  That’s what the ViewAutoToken class achieves.  The idea being that you collect data from the content authors at the time of token insertion, then use that data to build a view model and pass that model to a view cshtml.

Unique Aspects

When implementing a new view token you should extend the base class of ViewAutoToken.  This is very similar to an AutoToken except instead of implementing a method to render the raw html outputted by the token you define two methods, one to generate the view model and one to determine the view.

		public override object GetModel(TokenDataCollection extraData)
		{
			return extraData;
		}

		public override string GetViewPath(TokenDataCollection extraData)
		{
			return "/views/myToken.cshtml";
		}

AutoToken Features

All features from AutoTokens are available for the AutoViewTokens as well.  Such as gathering data from the content authors when applied to be used during rendering and filtering where the token may be used.

As usual with AutoTokens, you need only implement it in a loaded assembly and TokenManager will pick it up and wire it for use in RTEs.

Complete Example

	public class tokentest : ViewAutoToken
	{
		//Make sure you have a parameterless constructor.
		public tokentest() : base("test", "people/16x16/cubes_blue.png", "terkan")
		{
		}
		//This will add a button to the RTE.
		public override TokenButton TokenButton()
		{
			return new Data.TokenExtensions.TokenButton("test", "people/16x16/cubes_blue.png", 1000);
		}
		//These are the different fields that will be collected by the content authors at the time of insertion.
		public override IEnumerable<ITokenData> ExtraData()
		{
			yield return new GeneralLinkTokenData("LINK", "link", true);
			yield return new DroplistTokenData("Droplist", "droplist", true, new []
			{
				new KeyValuePair<string, string>("Text Label", "Value Passed"),
				new KeyValuePair<string, string>("Blue", "blue"),
			});
			yield return new BooleanTokenData("bool", "bool");
			yield return new IdTokenData("id", "id", true);
			yield return new IntegerTokenData("int", "int", true);
		}
		//These are the templates where the token may be used.
		public override IEnumerable<ID> ValidTemplates() {
			yield return new ID("{78816AC8-4FD7-43C4-A899-17829B4F3B72}");
		}
		//These are the root nodes that make a subtree where the token may be used.
		public override IEnumerable<ID> ValidParents()
		{
			yield return new ID("{A1E1342E-6836-4E20-A2C4-B1A38444B079}");
		}
		//Use the data gathered by the content author to assemble a view model.
		public override object GetModel(TokenDataCollection extraData)
		{
			return extraData;
		}
		//Use the data gathered by the content authors to define a path to the view cshtml.
		public override string GetViewPath(TokenDataCollection extraData)
		{
			return "/views/MyToken.cshtml";
		}
	}

And my view found at [webroot]/views/MyToken.cshtml

@using TokenManager.Data.TokenDataTypes.Support
@model TokenDataCollection
<div><strong>@Model.GetLink("link").Href</strong></div>
<div><strong>@Model.GetString("droplist")</strong></div>
<div><strong>@Model.GetBoolean("bool")</strong></div>
<div><strong>@Model.GetId("id")</strong></div>
<div><strong>@Model.GetInt("int")</strong></div>

Persistent Site and Lang Query string

I’ve always wondered why the default link provider of Sitecore doesn’t carry over site and language parameters.  Quite often I’ve found myself in a situation where the official site resolution for a Sitecore site has to do with domain pattern matching.  This leaves us with a difficult time to test things in an authoring server or development server without the proper DNS names.

There is however a solution.  With a few minor tweaks to the default link provider.  The logic is simple, if there exists in the url currently an sc_site or sc_lang query string parameter then generate all links with these parameters too

Enter the SiteStaticLinkProvider.

	public class SiteStaticLinkProvider : LinkProvider
	{
		public override string GetItemUrl(Item item, UrlOptions options)
		{
			string urlString = base.GetItemUrl(item, options);
			if (HttpContext.Current?.Request.QueryString == null)
				return urlString;
			string[] urlParts = urlString.Split('?');
			NameValueCollection qs = null;
			NameValueCollection currentqs = HttpContext.Current.Request.QueryString;
			if (!string.IsNullOrWhiteSpace(currentqs["sc_site"]))
			{
				qs = HttpUtility.ParseQueryString(urlParts.Length >= 2 ? urlParts[1] : "");
				if (string.IsNullOrWhiteSpace(qs["sc_site"]))
				{
					qs.Add("sc_site", currentqs["sc_site"]);
				}
			}
			if (!string.IsNullOrWhiteSpace(currentqs["sc_lang"]))
			{
				if (qs == null)
					qs = HttpUtility.ParseQueryString(urlParts.Length >= 2 ? urlParts[1] : "");
				if (string.IsNullOrWhiteSpace(qs["sc_lang"]))
				{
					qs.Add("sc_lang", currentqs["sc_lang"]);
				}
			}
			if (qs != null)
			{
				return urlParts[0] + '?' + qs;
			}
			return urlString;
		}
	}

This provider is a good all purpose link provider because if there are no pertinent parameters present it will not do anything.

The end result here is that to test any site in a pre-prod environment you need to only add the sc_lang or sc_site parameter once and it will follow you around the site, making this very easy for content approvers.

Wire it up!

There’re a few options available to overwrite a link provider. You can add a new provider, then change the reference of the providers node to point to your new provider. Slightly simpler however is to straight up override the default sitecore provider like i’ve done below.

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
	<sitecore>
		<linkManager>
			<providers>
				<add name="sitecore">
					<patch:attribute name="type">[Namespace].SiteStaticLinkProvider, [Binary Name]</patch:attribute>
				</add>
			</providers>
		</linkManager>
	</sitecore>
</configuration>

Search PDF content in sitecore

To people who have not tried to do this themselves, this seems like and easy task. All we need to do is get all the text content and load it in the search index. Initially i thought i had a good solution with PdfSharp using code that i found from this stack overflow post.  It seemed to be working fine until i attempted to run my site on Azure.   It apparently uses lower level OS based API calls that are just not available on Azure using the new Sitecore Paas setup.

There are several paid libraries that claim to be able to accomplish just this, however like most developers i wasn’t about to pitch buying a license to read PDF content to my clients. So the search continued.  After many hours (which i hope to save you from here) i came across a solution that did the trick (for the most part).

Reading PDF content

This code does require PdfSharp as a dependency, get it here on nuget.

NOTE: this code was adapted from this stack overflow post and is not entirely my own.  Although i don’t think it’s the poster on stack overflow who originated the code either.  Credit is due somewhere, but not quite sure where.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
using Sitecore.Data.Items;

namespace IHN.Feature.Component
{
	/// <summary>
	/// Addapted from code found here http://stackoverflow.com/questions/83152/reading-pdf-documents-in-net
	/// </summary>

	public class SitecorePdfParser
	{
		private int _numberOfCharsToKeep = 15;
		private PdfDocument _doc;

		public SitecorePdfParser(Item item): this(new MediaItem(item))
		{
		}
		public SitecorePdfParser(MediaItem item)
		{
			if (item.MimeType != "application/pdf")
				return;
			Stream s = item.GetMediaStream();
			_doc = PdfReader.Open(s);
		}

		public SitecorePdfParser(PdfDocument document)
		{
			_doc = document;
		}

		public IEnumerable<string> ExtractText()
		{
			if (_doc == null)
				yield break;
			foreach (PdfPage page in _doc.Pages)
			{
				for (int index = 0; index < page.Contents.Elements.Count; index++)
				{

					PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(index).Stream;
					foreach (string text in ExtractTextFromPdfBytes(stream.Value))
					{
						yield return text;
					}
				}
			}
		}
		/// <summary>
		/// This method processes an uncompressed Adobe (text) object
		/// and extracts text.
		/// </summary>

		/// <param name="input">uncompressed</param>
		/// <returns></returns>
		public IEnumerable<string> ExtractTextFromPdfBytes(byte[] input)
		{
			if (input == null || input.Length == 0) yield break;
			StringBuilder resultString = new StringBuilder();
			bool inTextObject = false;
			bool nextLiteral = false;
			int bracketDepth = 0;
			char[] previousCharacters = new char[_numberOfCharsToKeep];
			for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' '; 			foreach (byte t in input) 			{ 				char c = (char)t; 				if (inTextObject) 				{ 					// Position the text 					if (bracketDepth == 0) 					{ 						if (CheckToken(new[] { "TD", "Td" }, previousCharacters) || CheckToken(new[] { "'", "T*", "\"" }, previousCharacters) || CheckToken(new[] { "Tj" }, previousCharacters)) 						{ 							if (resultString.Length > 0)
							{
								yield return CleanupContent(resultString.ToString());
								resultString.Clear();
							}
						}
					}

					if (bracketDepth == 0 &&
						CheckToken(new string[] { "ET" }, previousCharacters))
					{
						inTextObject = false;
						if (resultString.Length > 0)
						{
							yield return CleanupContent(resultString.ToString());
							resultString.Clear();
						}
						continue;
					}

					if (c == '(' && bracketDepth == 0 && !nextLiteral)
					{
						bracketDepth = 1;
					}
					else if (c == ')' && bracketDepth == 1 && !nextLiteral)
					{
						bracketDepth = 0;
					}
					else if (bracketDepth == 1)
					{
						if (c == '\\' && !nextLiteral)
						{
							nextLiteral = true;
						}
						else
						{
							if (c == ' ')
							{
								if (resultString.Length > 0)
								{
									yield return CleanupContent(resultString.ToString());
									resultString.Clear();
								}
							}
							else if ((c >= '!' && c <= '~') || 									 (c >= 128 && c < 255))
							{
								resultString.Append(c);
							}
							nextLiteral = false;
						}
					}
				}

				// Store the recent characters for
				// when we have to go back for a checking
				for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
				{
					previousCharacters[j] = previousCharacters[j + 1];
				}
				previousCharacters[_numberOfCharsToKeep - 1] = c;

				if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
				{
					inTextObject = true;
				}
			}
		}
		private string CleanupContent(string text)
		{
			string[] patterns = { @"\\\(", @"\\\)", @"\\226", @"\\222", @"\\223", @"\\224", @"\\340", @"\\342", @"\\344", @"\\300", @"\\302", @"\\304", @"\\351", @"\\350", @"\\352", @"\\353", @"\\311", @"\\310", @"\\312", @"\\313", @"\\362", @"\\364", @"\\366", @"\\322", @"\\324", @"\\326", @"\\354", @"\\356", @"\\357", @"\\314", @"\\316", @"\\317", @"\\347", @"\\307", @"\\371", @"\\373", @"\\374", @"\\331", @"\\333", @"\\334", @"\\256", @"\\231", @"\\253", @"\\273", @"\\251", @"\\221" };
			string[] replace = { "(", ")", "-", "'", "\"", "\"", "à", "â", "ä", "À", "Â", "Ä", "é", "è", "ê", "ë", "É", "È", "Ê", "Ë", "ò", "ô", "ö", "Ò", "Ô", "Ö", "ì", "î", "ï", "Ì", "Î", "Ï", "ç", "Ç", "ù", "û", "ü", "Ù", "Û", "Ü", "®", "™", "«", "»", "©", "'" };

			for (int i = 0; i < patterns.Length; i++)
			{
				string regExPattern = patterns[i];
				Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
				text = regex.Replace(text, replace[i]);
			}

			return text;
		}
		/// <summary>
		/// Check if a certain 2 character token just came along (e.g. BT)
		/// </summary>

		/// <param name="search">the searched token</param>
		/// <param name="recent">the recent character array</param>
		/// <returns></returns>
		private bool CheckToken(string[] tokens, char[] recent)
		{
			foreach (string token in tokens)
			{
				if (token.Length > 1)
				{
					if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
						(recent[_numberOfCharsToKeep - 2] == token[1]) &&
						((recent[_numberOfCharsToKeep - 1] == ' ') ||
						(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
						(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
						((recent[_numberOfCharsToKeep - 4] == ' ') ||
						(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
						(recent[_numberOfCharsToKeep - 4] == 0x0a))
						)
					{
						return true;
					}
				}
				else
				{
					return false;
				}
			}
			return false;
		}
	}
}

Then we need to wire this up to the index crawler to make sure that the index uses this class to populate the search index with our Pdf content.

We need to implement a Sitecore IComputedIndexField class to accomplish this.

	public class IndexPdfContent : IComputedIndexField
	{
		public object ComputeFieldValue(IIndexable indexable)
		{
			try
			{
				var sitecoreIndexable = indexable as SitecoreIndexableItem;

				if (sitecoreIndexable == null) return null;

				var pdfContent = new SitecorePdfParser(new MediaItem(sitecoreIndexable)).ExtractText().ToList();

				if (pdfContent.Count == 0) return null;

				return string.Join(" ", pdfContent);
			}
			catch (Exception e)
			{
				Log.Error("Unable to assemble PDF content for the search index ", e, this);
				return null;
			}
		}
	}

And finally wire it up to the indexer

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
	<sitecore>
		<contentSearch>
			<indexConfigurations>
				<defaultLuceneIndexConfiguration>
					<documentOptions>
						<fields hint="raw:AddComputedIndexField">
							<!-- indexes pdf contents into index _content field to allow PDF search -->
							<field fieldName="_pdfcontent" type="[NAMESPACE].IndexPdfContent, [DLL NAME]" />
						</fields>
					</documentOptions>
				</defaultLuceneIndexConfiguration>
				<defaultSolrIndexConfiguration>
					<documentOptions>
						<fields hint="raw:AddComputedIndexField">
							<!-- indexes pdf contents into index _content field to allow PDF search -->
							<field fieldName="_pdfcontent" type="[NAMESPACE].IndexPdfContent, [DLL NAME]" />
						</fields>
					</documentOptions>
				</defaultSolrIndexConfiguration>
				<defaultCloudIndexConfiguration>
					<documentOptions>
						<fields hint="raw:AddComputedIndexField">
							<!-- indexes pdf contents into index _content field to allow PDF search -->
							<field fieldName="pdf_content" cloudFieldName="pdf_content" type="[NAMESPACE].IndexPdfContent, [DLL NAME]" />
						</fields>
					</documentOptions>
				</defaultCloudIndexConfiguration>
			</indexConfigurations>
		</contentSearch>
	</sitecore>
</configuration>

Ending Results

Now we have our search index populated with PDF contents. So if someone wants to find a PDF with a text search it’s as simple as querying the index on the field assigned in the xml with the users search text.

Disclaimer

While this solution is quite good, it’s not perfect. If you have text in PDF images, it won’t find that. Additionally I’ve noticed that in rare cases words might be broken up when they’re being extracted. Presumably this is due to PDF formatting. If you happen to figure out how to resolve this completely, let me know and i’d love to update this code.

Sitecore RTE Button Postprocessing

There may be times that you want to modify the way stock Sitecore RTE buttons work without actually modifying stock Sitecore files.  An easy way to accomplish this is to override the Telerik editor commands manually using a custom js file.

Some common uses of this technique could include

  1. Adding classing to injected elements.
  2. Wrapping injected elements in a wrapping element.
  3. Adding a sibling html element for an icon perhaps.
  4. Modifying the markup for SEO needs.
  5. Modifying the markup to build a responsive website.

Find the operation to patch

The first thing you need to do is find the RTE command for the button you’d like to add post processing to.  Easiest way to do this is by using your browsers inspect feature on the button you’d like to enhance.

finding-command

The class of the span element that makes up the button is the name of the command you’re interested in.  At this point you can start writing your javascript.

The Javascript

var	RadEditorCommandList = Telerik.Web.UI.Editor.CommandList;

var table = RadEditorCommandList["InsertTable"];
RadEditorCommandList["InsertTable"] = function (commandName, editor, args) {
	table(commandName, editor, args);
	var p = editor.getSelectedElement().parentNode.parentNode.parentNode;
	p.classList.add("editor-table")
};

This code will modify the insert table button to add a class of editor-table to the table after it’s injected.

So what are we doing here, let’s analyze it.

  1. Get the telerik editor command list object.  This object stores the javascript that drives each of the buttons in the editor.
  2. Save the original command into a custom variable called table
  3. Replace the method attacked to the telerik editor command list with our own function
  4. Using the telerik editor object to get the selected element after the table is inserted and traverse up to the
    node
  5. Add a class of editor-table to the table root

 
You’ll likely need to utilize the debugger to drop breakpoints down in your code and use the console to identify the correct element you’ll need to manipulate.

Having Sitecore add your javascript to the editor

There’s a simple config patch to add your javascript to the RTE editor.

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
	<sitecore>
		<clientscripts>
			<htmleditor>
				<script key="customsrc" src="/relative/path/to/customSitecore.js" language="JavaScript"/>
			</htmleditor>
		</clientscripts>
	</sitecore>
</configuration>

Using this method is not limited to ONLY postprocessing, but you could essentially take stock methods and do whatever you please with them. Sky’s the limit, so go have fun with it.