Sharing Header/Footer across platforms

I had a requirement that a site i was building was required to have the headers and footers sourced from a site owned by the parent company on a separate platform.  Sounds a bit insane but doable.

Make certain your site is extremely clean for JS and CSS

The first thing you’re going to want to verify is that you have nothing targeting general elements.  For example all your styling should be done by very specific class targeting.  Something like “my-secret-class” is great whereas “form” not so much.  Even worse would be to style root level elements such as assigning styling to the li element.

In short, don’t use any JS/CSS that could interfere with things coming from their other domain.

Scrape and cache

Next you’ll want to scrape the source site and parse out their header/footer and all CSS/JS using HtmlAgilityPack

		private readonly Dictionary<string, string> _referrerHeaders = new Dictionary<string, string>();
		private readonly Dictionary<string, string> _referrerFooter = new Dictionary<string, string>();
		private readonly object _refreshLocker = new object();

		public virtual string GetHeader()
		{
			lock (_refreshLocker)
			{
				ValidateUrl(GetOriginModel()?.ReturnUrl);
				string ret;
				_referrerHeaders.TryGetValue(GetOriginModel()?.ReturnUrl ?? "", out ret);
				return ret ?? "";
			}
		}
		public virtual string GetFooter()
		{
			lock (_refreshLocker)
			{
				ValidateUrl(GetOriginModel()?.ReturnUrl);
				string ret;
				_referrerFooter.TryGetValue(GetOriginModel()?.ReturnUrl ?? "", out ret);
				return ret ?? "";
			}
		}

		public virtual void ValidateUrl(string url)
		{
			if (string.IsNullOrWhiteSpace(url) || url.StartsWith("/"))
				return;
			if (!_referrerHeaders.ContainsKey(url))
			{
				HtmlDocument doc = new HtmlDocument();
				using (WebClient wc = new WebClient())
				{
					wc.Encoding = Encoding.UTF8;
					doc.LoadHtml(wc.DownloadString(url));
				}
				_referrerHeaders[url] = GenerateHeader(url, doc);
				_referrerFooter[url] = GenerateFooter(doc);
			}
		}
		public virtual string GenerateFooter(HtmlDocument doc)
		{
			return GetNodesByAttribute(doc, "class", "site-footer").FirstOrDefault()?.OuterHtml;
		}

		public virtual string GenerateHeader(string url, HtmlDocument doc)
		{
			Uri uri = new Uri(url);
			string markup =  GetNodesByAttribute(doc, "class", "site-header").FirstOrDefault()?.OuterHtml.Replace("action=\"/", $"action=\"https://{uri.Host}/");
			string svg = GetNodesByAttribute(doc, "class", "svg-legend").FirstOrDefault()?.OuterHtml;
			string stylesheets =
				GetNodesByAttribute(doc, "rel", "stylesheet")
					.Aggregate(new StringBuilder(), (tags, cur) => tags.Append(cur.OuterHtml.Replace("href=\"/bundles", $"href=\"https://{uri.Host}/bundles")))
					.ToString();
			string javascripts =
				doc.DocumentNode.SelectNodes("//script")
					.Aggregate(new StringBuilder(), (tags, cur) =>
					{
						if (cur.OuterHtml.Contains("gtm.js"))
							return tags;
					  return tags.Append(cur.OuterHtml.Replace("src=\"/bundles", $"src=\"https://{uri.Host}/bundles"));

					})
					.ToString();

			return $"{svg}{stylesheets}{markup}{javascripts}";
		}

		public virtual HtmlNodeCollection GetNodesByAttribute(HtmlDocument doc, string attribute, string value)
		{
			return doc.DocumentNode.SelectNodes($"//*[contains(@{attribute},'{value}')]");
		}

NOTE: You’ll likely need to heavily customize your GenerateHeader and GenerateFooter methods.

Lets break this down a bit as it’s a bit hard to follow.

  1. You pass in a URL that you want to source your headers and footers from
  2. Checks the cache to see if we already have that header/footer
  3. Using a WebClient it scrapes the markup off the source page
  4. Using whatever means we can we identify where the markup comes from for the header and footer, in this case it’s identifiable from a class of “site-footer” and “site-header” which makes it easier
  5. We make sure we turn any relative links into absolute links, since relative won’t work anymore since the thing is operating on a separate domain
  6. We grab their SVG sprite definition, we’ll need that or their icons will be blank
  7. Grab all stylesheets and scripts making sure to trip out the things that don’t make sense on a case by case basis like the the other domains tracking libraries
  8. Store this information in the cache

Make sure you periodically clear the caches to pick up changes from the source.  I did this simply like this

		public SiteComponentShareService()
		{
			Timer t = new Timer(600 * 1000);
			t.Elapsed += (sender, args) =>
			{
				lock (_refreshLocker)
				{
					_referrerHeaders.Clear();
					_referrerFooter.Clear();
				}
			};
			t.Start();
		}

 This clears the cache objects every 10 minutes with thread lockers to make sure it doesn’t clear the cache as something is trying to use it.

Finishing Touches

The acquired header and footer may have fancy XHR needs that need to be accounted for. Very likely for this you’ll need to proxy requests. For example i needed to catch search suggestions and pass it through to their servers endpoint for hawksearch

		[Route("hawksearch/proxyautosuggest/{target}")]
		public ActionResult RerouteAutosuggest(string target)
		{
			WebClient wc = new WebClient();
			string ret = wc.DownloadString(
				$"https://www.parentsitewherewefoundtheheaders.org/hawksearch/proxyAutoSuggest/{target}?{Request.QueryString}");
			return Content(ret);

		}

As you can see, we’re simply catching it and passing it along to their domain’s endpoint. Since we have their same javascript code and their same headers this simple pass-through allows us to seem like we have the exact same header.

Become a Sitecore PDF Ninja

I’m going to start this off by saying PDFs are evil and if you can avoid using them, i implore you to avoid at all costs.  It will inevitably lead to lots of frustration.

In our world today PDFs are incredibly prevalent.  Seems like almost every organization has a collection of PDFs for download.  Users and corporations alike seem to have embraced the PDF completely, however that doesn’t change the fact that they are incredibly annoying to programatically and dynamically manage.

In a C# world you have two main choices for managing PDFs the first is ITextSharp.  However i didn’t look into this library much because i noticed it’s pricing model.  In a nutshell it’s free as long as whatever your building is completely open source.  I suspect the vast majority of Sitecore clients are closed source.  Unfortunately it also looks like there are a sizable amount of people who missed this fact and are stealing this library for commercial gain potentially opening themselves up for lawsuit.  Scary stuff, so i looked elsewhere.

I chose to instead focus on PdfSharp which is free for any situation.  They also have a tool called MigraDoc specifically for building PDFs which i found particularly handy.

I have already outlined a solution to make PDFs searchable in the Sitecore search index.  Here i’m going to share a few more tricks I’ve discovered.

Generating PDFs

If you want to generate a PDF out of markup you’re going to be out of luck as a generality as due to the dramatic differences in the medium (HTML being for screens, PDFs being for printing) you’re never going to get perfect.  I believe ITextSharp has a method to do this but PDFSharp does not.  I did see this workaround i thought was interesting and perhaps worth a try.

I chose to use MigraDoc which ended up being quite easy.  There are a few paradigm changes that you need to understand.

  1. There are no pixels in PDFs, measurements are in actual lengths (inches, centimeters, etc.)
  2. Each Page is it’s own entity that can have different widths, headers, footers, margins etc..
  3. There are element similar to most HTML elements such as paragraphs, headers, tables, etc…
  4. Each element has default settings for sizes and spacing that can be overridden on the individual basis.

Here is a sample of setting up default elements and page settings

		private static void PdfDocumentSetup(Document doc)
		{
			//Default text
			Style style = doc.Styles["Normal"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(12);
			//Body Text
			style = doc.Styles.AddStyle("Paragraph2", "Normal");
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(12);
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//Title
			style = doc.Styles["Heading1"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(45);
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//SubHeading
			style = doc.Styles["Heading2"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(16);
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//SubHeading
			style = doc.Styles["Heading3"];
			style.Font.Name = "Arial Narrow";
			style.Font.Size = Unit.FromPoint(20);
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//SubHeading
			style = doc.Styles["Heading4"];
			style.Font.Name = "Arial";
			style.Font.Size = Unit.FromPoint(16);
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.PageBreakBefore = false;
			//Column Heading
			style = doc.Styles["Heading5"];
			style.Font.Name = "Arial";
			style.Font.Size = Unit.FromPoint(20);
			style.Font.Color = Color.FromRgbColor(255, new Color(0, 128, 192));
			style.Font.Bold = true;
			style.ParagraphFormat.SpaceAfter = 6;
			style.ParagraphFormat.SpaceBefore = 12;
			style.ParagraphFormat.PageBreakBefore = false;
			//Bullets
			style = doc.AddStyle("Bullets", "Normal");
			style.ParagraphFormat.LeftIndent = Unit.FromInch(1.25);
			// Underlined section heading
			style = doc.AddStyle("Heading3Underlined", "Heading3");
			style.ParagraphFormat.Borders.Bottom = new Border() { Width = Unit.FromMillimeter(1), Color = Colors.Black };
			doc.DefaultPageSetup.PageHeight = Unit.FromInch(11);
			doc.DefaultPageSetup.PageWidth = Unit.FromInch(8.5);
			doc.DefaultPageSetup.LeftMargin = Unit.FromInch(.5);
			doc.DefaultPageSetup.RightMargin = Unit.FromInch(.5);
			doc.DefaultPageSetup.FooterDistance = Unit.FromInch(.75);
			doc.DefaultPageSetup.HeaderDistance = Unit.FromInch(.75);
			doc.DefaultPageSetup.TopMargin = Unit.FromInch(1.5);
			doc.DefaultPageSetup.BottomMargin = Unit.FromInch(2);
		}

Aggregate PDFs

You might need to combine two PDFs or take a cover letter PDF and combine it with a generated portion of the PDF.  In my case i had to take a customized cover letter and prepend it to a table output of data.

PdfSharp makes this amazingly easy.  Simply open both PDF sources in PdfSharp.  In migradoc, you can do this by saving the generated PDF to stream then opening the stream in PdfSharp.

			MemoryStream ret = new MemoryStream();
			PdfDocumentRenderer renderer = new PdfDocumentRenderer(true) { Document = doc };
			renderer.RenderDocument();
			renderer.Save(ret, false);

The above code will take the Migradoc document (doc) and render it to a memory stream which can then be opened in PdfSharp

			//_sitecore is a Sitecore Item API abstraction service to allow testability
			var coverletter = PdfReader.Open(_sitecore.GetPdfCoverletterStream(item), PdfDocumentOpenMode.Import);
			var pdf = PdfReader.Open(doc); // this is our stream from above
			for (int i = 0; i < coverletter.PageCount; i++)
			{
				var newPage = coverletter.Pages[i];
				pdf.Pages.Insert(i, newPage);
			}
			MemoryStream ret = new MemoryStream();
			pdf.Save(ret, false);
			return ret;

You simply take each page from one document and insert it into the other then save the result in whatever way you need, stream for us.

Injecting and reading PDFs from Sitecore Media

Getting PDFs from Sitecore is easy. You can use the MediaManager to get the PDF stream like so.

		public Stream GetPdfStream(Item pdf)
		{
			return MediaManager.GetMedia(pdf).GetStream().Stream;
		}

Once you’ve made your modifications you can write your PDF to a Sitecore media item like so:

					using (new SecurityDisabler())
					{
						pdfItem.Editing.BeginEdit();
						pdfItem.Fields["Blob"].SetBlobStream(pdf);//our stream that we were working with
						pdfItem.Fields["Extension"].Value = "pdf";
						pdfItem.Fields["Mime Type"].Value = "application/pdf";
						pdfItem.Editing.EndEdit();
					}

Find and replace tokens inside a PDF

			var coverletter = PdfReader.Open(_sitecore.GetPdfStream(item), PdfDocumentOpenMode.Import);
			for (int i = 0; i < coverletter.PageCount; i++)
			{
				var newPage = coverletter.Pages[i];

				for (int j = 0; j < newPage.Contents.Elements.Count; j++)
				{
					PdfDictionary.PdfStream stream = newPage.Contents.Elements.GetDictionary(j).Stream;
					var inStream = stream.Value;
					StringBuilder stringStream = new StringBuilder();
					foreach (byte b in inStream)
						stringStream.Append((char)b);

					stringStream = stringStream.Replace("__day__", DateTime.Now.Day.ToString()).Replace("__month__", DateTime.Now.ToString("MMMM")).Replace("__year__", DateTime.Now.Year.ToString());

					newPage.Contents.Elements.GetDictionary(j).Stream.Value = Encoding.UTF8.GetBytes(stringStream.ToString());
				}
			}

In this approach we can see that if we convert the stream to a byte array it will contain all the characters used in this PDF. BEFORE YOU GO THINKING THIS WILL ALWAYS WORK (it likely won’t).  There are many things that need to be the case for this to work as seen here.

as far as i can figure, a sure fire way of this working isn’t possible, however if you can work with a standard PDF creation process from your content authors with some tweaks you can get this to work.

Sitecore Azure Search Issues

As soon as Sitecore 8.2 came out with a PAAS option i adopted immediately for a client. The main driving factor was that the client was a Microsoft shop and the idea of having a java search tool (SOLR) was a hard thing for them to swallow. They loved the idea of Azure Search and bought into it immediately.

I had my concerns about using a new technology in Sitecore but i decided to give it a try anyway. I found a number of trouble spots, both with Sitecore’s implementation as well as some limitations of the tool in general.

Sitecore API bugs

Sitecore Search API unable to query by datetimes.

If you’re using Sitecore’s API to access the search index, which is Sitecore’s recommended way to go, you are unable to query by datetimes.  So if you’re making an event search tool, you might want to either reconsider using Sitecore’s search API or go with a direct to Azure Search solution.

Cannot query for ID right after an app pool recycle

Immediately following an app pool recycle the Sitecore search api is unable to query by item ID.  This however can be alleviated if you go direct to the index.

Azure Search shortcomings

 

Azure Search only facets with AND logic never OR logic

A very common search scenario is to have the logic be OR within a particular grouping then and between groups.  For example, if you are looking for a new PC you might want to search for a PC that has an I7 processor and is in the price range of 600 – 1200 dollars, if you’re give range facets of 600 – 800, 800 – 1000, and 1000 – 1200.  Logically you would want to select all 3 options to have the range and processor you want.  However the faceting in Azure Search makes it not possible to do this, as soon as you select 600 – 800 all the other options will disappear as there can’t be computers that fall under 2 separate price ranges.

Hypothetically it would be possible to overcome this by making multiple queries to the index, however it would be a quite complex solution and increase the load on your index by potentially many times.

There is no ability to use wild cards in filters

The only wild cards that are accepted are in the text search query, not for filters.  Say for example you’re creating a search for restaurants and you want to give the user the ability to search for a city and text search.  You need to assume that the user typed the entire city name and not just a fragment in order to get any results.

As far as i can figure there isn’t a reasonable way to overcome this issue.

Recommendation

As it stands i would certainly recommend using SOLR in the cloud if you want to do any amount of work with the index.  A good cloud set up i have used is to set up an IAAS VM running SOLR and a virtual network into your PAAS Sitecore environment for fast connectivity.  Strangely enough, this also seems to cost less money than Azure Search.

Direct To Azure Search API

While i don’t recommend it at this time, the tool itself was quite easy to work with directly. Take a look at the documentation  to get started.

Here is an example of taking the connection string Sitecore uses and creating an Azure Search API object.

			ConnectionStringSettings search = ConfigurationManager.ConnectionStrings["cloud.search"];
			if (search == null)
				throw new Exception("Missing connection string for Azure Search");
			Dictionary<string, string> connStringParts = search.ConnectionString.Split(';')
	.Select(t => t.Split(new char[] { '=' }, 2))
	.ToDictionary(t => t[0].Trim(), t => t[1].Trim(), StringComparer.InvariantCultureIgnoreCase);
			try
			{
				SearchServiceClient client = new SearchServiceClient(new Uri(connStringParts["serviceUrl"]),
					new SearchCredentials(connStringParts["apiKey"]));
			//use or cache client
			}
			catch (Exception e)
			{
				throw new Exception("Unable to use connection string values", e);
			}

This is an example of using the index to set up a search with faceting, pagination, and sorting.  Note that i’m using a constants class to abstract away string literals.

			SearchParameters parameters = new SearchParameters
			{
				QueryType = QueryType.Simple,
				Skip = page * 10,
				IncludeTotalResultCount = true,
				Top = 10,
				SearchFields = new List<string>
				{
					AzureSearchConstants.FirstName,
					AzureSearchConstants.LastName,
					AzureSearchConstants.GroupName,
					AzureSearchConstants.LocationName,
					AzureSearchConstants.Address,
					AzureSearchConstants.City,
					AzureSearchConstants.County,
					AzureSearchConstants.State,
					AzureSearchConstants.Zip
				},
				Filter = facetQuery.ToString(),
				Facets = new List<string> { AzureSearchConstants.TypeDescription }
				OrderBy = new List<string> { AzureSearchConstants.LastName }
			};
			var index = _client.Indexes.GetClient(AzureSearchConstants.IndexName);
			var results = index.Documents.Search(query.Trim().Replace(" ", "* ") + '*', parameters);

No Speak Experience Profile Tab

For a client i was asked to collect some extra data and put it in Xdb and use it to personalize content.  The approach was simple create a facet like Pete Navarra outlines here.  Then build some custom rules like i outlined here.  Finally it was asked to add the collected data to the Experience profile.  That’s when the pain came.

Adam Conn has outlined how to do it in the official Sitecore way here.  As you can see the process involves building a Speak component for the tab, this is a long process and very tedious.  This lead me to report it as a large task, and the client wasn’t willing to take the extra time needed to get it all set up properly and the request was abandoned.

This lead me to think that there must be a better way, which i have found!

Enter the Experience Profile Express Tab

You can find the Nuget package here.  and the source code and developer documentation here.

This Module automates the construction of a speak component and wraps it around a proper MVC structure where you build a controller class to generate a model poco generated from the contact and pass it to a view.

For my first application of this module I built a tab to show Demandbase data collected by the Sitecore Demandbase module (which you can contact your Demandbase sales rep to acquire).

demandbasetab

This tab can be accomplished with a single C# class. First we take the data from the Demandbase facet which is a json object. We deserialize this to dictionary and dump it out to HTML.

	public class DemandbaseTab : EPExpressTab.Data.EpExpressModel
	{
		public override string RenderToString(Contact contact)
		{
			dynamic o = JsonConvert.DeserializeObject<ExpandoObject>(
				contact.GetFacet<IXdbFacetDemandbaseData>("Demandbase Data").DemandBaseData ?? "");
			StringBuilder sb = new StringBuilder();
			if (o == null)
				return "<div>Demandbase information not available.</div>";
			IDictionary<string, object> tst = (IDictionary<string, object>) o;
			bool even = false;
			foreach (string attr in tst.Keys)
			{
				if (tst[attr] is string)
				{
					sb.Append(
						$"
<div style='background-color:{(even ? "#fff" : "#eee")}'><span style='width:200px;display:inline-block;font-weight:bold;font-size:medium;'>{UppercaseWords(attr)}</span>{tst[attr]}</div>
");
					even = !even;
				}
			}
			return sb.ToString();
		}

		public override string Heading => "Demandbase Attributes";
		public override string TabLabel => "Demandbase";
		private string UppercaseWords(string value)
		{
			char[] array = value.ToCharArray();
			// Handle the first letter in the string.
			if (array.Length >= 1)
			{
				if (char.IsLower(array[0]))
				{
					array[0] = char.ToUpper(array[0]);
				}
			}
			// Scan through the letters, checking for spaces.
			// ... Uppercase the lowercase letters following spaces.
			for (int i = 1; i < array.Length; i++)
			{
				if (array[i - 1] == ' ')
				{
					if (char.IsLower(array[i]))
					{
						array[i] = char.ToUpper(array[i]);
					}
				}
				if (array[i] == '_')
				{
					array[i] = ' ';
				}
			}
			return new string(array);
		}
	}

TokenManager View Tokens

Likely fitting in the wheelhouse of most Sitecore developers is building a view model and passing it to a view to be rendered.  That’s what the ViewAutoToken class achieves.  The idea being that you collect data from the content authors at the time of token insertion, then use that data to build a view model and pass that model to a view cshtml.

Unique Aspects

When implementing a new view token you should extend the base class of ViewAutoToken.  This is very similar to an AutoToken except instead of implementing a method to render the raw html outputted by the token you define two methods, one to generate the view model and one to determine the view.

		public override object GetModel(TokenDataCollection extraData)
		{
			return extraData;
		}

		public override string GetViewPath(TokenDataCollection extraData)
		{
			return "/views/myToken.cshtml";
		}

AutoToken Features

All features from AutoTokens are available for the AutoViewTokens as well.  Such as gathering data from the content authors when applied to be used during rendering and filtering where the token may be used.

As usual with AutoTokens, you need only implement it in a loaded assembly and TokenManager will pick it up and wire it for use in RTEs.

Complete Example

	public class tokentest : ViewAutoToken
	{
		//Make sure you have a parameterless constructor.
		public tokentest() : base("test", "people/16x16/cubes_blue.png", "terkan")
		{
		}
		//This will add a button to the RTE.
		public override TokenButton TokenButton()
		{
			return new Data.TokenExtensions.TokenButton("test", "people/16x16/cubes_blue.png", 1000);
		}
		//These are the different fields that will be collected by the content authors at the time of insertion.
		public override IEnumerable<ITokenData> ExtraData()
		{
			yield return new GeneralLinkTokenData("LINK", "link", true);
			yield return new DroplistTokenData("Droplist", "droplist", true, new []
			{
				new KeyValuePair<string, string>("Text Label", "Value Passed"),
				new KeyValuePair<string, string>("Blue", "blue"),
			});
			yield return new BooleanTokenData("bool", "bool");
			yield return new IdTokenData("id", "id", true);
			yield return new IntegerTokenData("int", "int", true);
		}
		//These are the templates where the token may be used.
		public override IEnumerable<ID> ValidTemplates() {
			yield return new ID("{78816AC8-4FD7-43C4-A899-17829B4F3B72}");
		}
		//These are the root nodes that make a subtree where the token may be used.
		public override IEnumerable<ID> ValidParents()
		{
			yield return new ID("{A1E1342E-6836-4E20-A2C4-B1A38444B079}");
		}
		//Use the data gathered by the content author to assemble a view model.
		public override object GetModel(TokenDataCollection extraData)
		{
			return extraData;
		}
		//Use the data gathered by the content authors to define a path to the view cshtml.
		public override string GetViewPath(TokenDataCollection extraData)
		{
			return "/views/MyToken.cshtml";
		}
	}

And my view found at [webroot]/views/MyToken.cshtml

@using TokenManager.Data.TokenDataTypes.Support
@model TokenDataCollection
<div><strong>@Model.GetLink("link").Href</strong></div>
<div><strong>@Model.GetString("droplist")</strong></div>
<div><strong>@Model.GetBoolean("bool")</strong></div>
<div><strong>@Model.GetId("id")</strong></div>
<div><strong>@Model.GetInt("int")</strong></div>

Persistent Site and Lang Query string

I’ve always wondered why the default link provider of Sitecore doesn’t carry over site and language parameters.  Quite often I’ve found myself in a situation where the official site resolution for a Sitecore site has to do with domain pattern matching.  This leaves us with a difficult time to test things in an authoring server or development server without the proper DNS names.

There is however a solution.  With a few minor tweaks to the default link provider.  The logic is simple, if there exists in the url currently an sc_site or sc_lang query string parameter then generate all links with these parameters too

Enter the SiteStaticLinkProvider.

	public class SiteStaticLinkProvider : LinkProvider
	{
		public override string GetItemUrl(Item item, UrlOptions options)
		{
			string urlString = base.GetItemUrl(item, options);
			if (HttpContext.Current?.Request.QueryString == null)
				return urlString;
			string[] urlParts = urlString.Split('?');
			NameValueCollection qs = null;
			NameValueCollection currentqs = HttpContext.Current.Request.QueryString;
			if (!string.IsNullOrWhiteSpace(currentqs["sc_site"]))
			{
				qs = HttpUtility.ParseQueryString(urlParts.Length >= 2 ? urlParts[1] : "");
				if (string.IsNullOrWhiteSpace(qs["sc_site"]))
				{
					qs.Add("sc_site", currentqs["sc_site"]);
				}
			}
			if (!string.IsNullOrWhiteSpace(currentqs["sc_lang"]))
			{
				if (qs == null)
					qs = HttpUtility.ParseQueryString(urlParts.Length >= 2 ? urlParts[1] : "");
				if (string.IsNullOrWhiteSpace(qs["sc_lang"]))
				{
					qs.Add("sc_lang", currentqs["sc_lang"]);
				}
			}
			if (qs != null)
			{
				return urlParts[0] + '?' + qs;
			}
			return urlString;
		}
	}

This provider is a good all purpose link provider because if there are no pertinent parameters present it will not do anything.

The end result here is that to test any site in a pre-prod environment you need to only add the sc_lang or sc_site parameter once and it will follow you around the site, making this very easy for content approvers.

Wire it up!

There’re a few options available to overwrite a link provider. You can add a new provider, then change the reference of the providers node to point to your new provider. Slightly simpler however is to straight up override the default sitecore provider like i’ve done below.

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
	<sitecore>
		<linkManager>
			<providers>
				<add name="sitecore">
					<patch:attribute name="type">[Namespace].SiteStaticLinkProvider, [Binary Name]</patch:attribute>
				</add>
			</providers>
		</linkManager>
	</sitecore>
</configuration>

Search PDF content in sitecore

To people who have not tried to do this themselves, this seems like and easy task. All we need to do is get all the text content and load it in the search index. Initially i thought i had a good solution with PdfSharp using code that i found from this stack overflow post.  It seemed to be working fine until i attempted to run my site on Azure.   It apparently uses lower level OS based API calls that are just not available on Azure using the new Sitecore Paas setup.

There are several paid libraries that claim to be able to accomplish just this, however like most developers i wasn’t about to pitch buying a license to read PDF content to my clients. So the search continued.  After many hours (which i hope to save you from here) i came across a solution that did the trick (for the most part).

Reading PDF content

This code does require PdfSharp as a dependency, get it here on nuget.

NOTE: this code was adapted from this stack overflow post and is not entirely my own.  Although i don’t think it’s the poster on stack overflow who originated the code either.  Credit is due somewhere, but not quite sure where.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
using Sitecore.Data.Items;

namespace IHN.Feature.Component
{
	/// <summary>
	/// Addapted from code found here http://stackoverflow.com/questions/83152/reading-pdf-documents-in-net
	/// </summary>

	public class SitecorePdfParser
	{
		private int _numberOfCharsToKeep = 15;
		private PdfDocument _doc;

		public SitecorePdfParser(Item item): this(new MediaItem(item))
		{
		}
		public SitecorePdfParser(MediaItem item)
		{
			if (item.MimeType != "application/pdf")
				return;
			Stream s = item.GetMediaStream();
			_doc = PdfReader.Open(s);
		}

		public SitecorePdfParser(PdfDocument document)
		{
			_doc = document;
		}

		public IEnumerable<string> ExtractText()
		{
			if (_doc == null)
				yield break;
			foreach (PdfPage page in _doc.Pages)
			{
				for (int index = 0; index < page.Contents.Elements.Count; index++)
				{

					PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(index).Stream;
					foreach (string text in ExtractTextFromPdfBytes(stream.Value))
					{
						yield return text;
					}
				}
			}
		}
		/// <summary>
		/// This method processes an uncompressed Adobe (text) object
		/// and extracts text.
		/// </summary>

		/// <param name="input">uncompressed</param>
		/// <returns></returns>
		public IEnumerable<string> ExtractTextFromPdfBytes(byte[] input)
		{
			if (input == null || input.Length == 0) yield break;
			StringBuilder resultString = new StringBuilder();
			bool inTextObject = false;
			bool nextLiteral = false;
			int bracketDepth = 0;
			char[] previousCharacters = new char[_numberOfCharsToKeep];
			for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' '; 			foreach (byte t in input) 			{ 				char c = (char)t; 				if (inTextObject) 				{ 					// Position the text 					if (bracketDepth == 0) 					{ 						if (CheckToken(new[] { "TD", "Td" }, previousCharacters) || CheckToken(new[] { "'", "T*", "\"" }, previousCharacters) || CheckToken(new[] { "Tj" }, previousCharacters)) 						{ 							if (resultString.Length > 0)
							{
								yield return CleanupContent(resultString.ToString());
								resultString.Clear();
							}
						}
					}

					if (bracketDepth == 0 &&
						CheckToken(new string[] { "ET" }, previousCharacters))
					{
						inTextObject = false;
						if (resultString.Length > 0)
						{
							yield return CleanupContent(resultString.ToString());
							resultString.Clear();
						}
						continue;
					}

					if (c == '(' && bracketDepth == 0 && !nextLiteral)
					{
						bracketDepth = 1;
					}
					else if (c == ')' && bracketDepth == 1 && !nextLiteral)
					{
						bracketDepth = 0;
					}
					else if (bracketDepth == 1)
					{
						if (c == '\\' && !nextLiteral)
						{
							nextLiteral = true;
						}
						else
						{
							if (c == ' ')
							{
								if (resultString.Length > 0)
								{
									yield return CleanupContent(resultString.ToString());
									resultString.Clear();
								}
							}
							else if ((c >= '!' && c <= '~') || 									 (c >= 128 && c < 255))
							{
								resultString.Append(c);
							}
							nextLiteral = false;
						}
					}
				}

				// Store the recent characters for
				// when we have to go back for a checking
				for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
				{
					previousCharacters[j] = previousCharacters[j + 1];
				}
				previousCharacters[_numberOfCharsToKeep - 1] = c;

				if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
				{
					inTextObject = true;
				}
			}
		}
		private string CleanupContent(string text)
		{
			string[] patterns = { @"\\\(", @"\\\)", @"\\226", @"\\222", @"\\223", @"\\224", @"\\340", @"\\342", @"\\344", @"\\300", @"\\302", @"\\304", @"\\351", @"\\350", @"\\352", @"\\353", @"\\311", @"\\310", @"\\312", @"\\313", @"\\362", @"\\364", @"\\366", @"\\322", @"\\324", @"\\326", @"\\354", @"\\356", @"\\357", @"\\314", @"\\316", @"\\317", @"\\347", @"\\307", @"\\371", @"\\373", @"\\374", @"\\331", @"\\333", @"\\334", @"\\256", @"\\231", @"\\253", @"\\273", @"\\251", @"\\221" };
			string[] replace = { "(", ")", "-", "'", "\"", "\"", "à", "â", "ä", "À", "Â", "Ä", "é", "è", "ê", "ë", "É", "È", "Ê", "Ë", "ò", "ô", "ö", "Ò", "Ô", "Ö", "ì", "î", "ï", "Ì", "Î", "Ï", "ç", "Ç", "ù", "û", "ü", "Ù", "Û", "Ü", "®", "™", "«", "»", "©", "'" };

			for (int i = 0; i < patterns.Length; i++)
			{
				string regExPattern = patterns[i];
				Regex regex = new Regex(regExPattern, RegexOptions.IgnoreCase);
				text = regex.Replace(text, replace[i]);
			}

			return text;
		}
		/// <summary>
		/// Check if a certain 2 character token just came along (e.g. BT)
		/// </summary>

		/// <param name="search">the searched token</param>
		/// <param name="recent">the recent character array</param>
		/// <returns></returns>
		private bool CheckToken(string[] tokens, char[] recent)
		{
			foreach (string token in tokens)
			{
				if (token.Length > 1)
				{
					if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
						(recent[_numberOfCharsToKeep - 2] == token[1]) &&
						((recent[_numberOfCharsToKeep - 1] == ' ') ||
						(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
						(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
						((recent[_numberOfCharsToKeep - 4] == ' ') ||
						(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
						(recent[_numberOfCharsToKeep - 4] == 0x0a))
						)
					{
						return true;
					}
				}
				else
				{
					return false;
				}
			}
			return false;
		}
	}
}

Then we need to wire this up to the index crawler to make sure that the index uses this class to populate the search index with our Pdf content.

We need to implement a Sitecore IComputedIndexField class to accomplish this.

	public class IndexPdfContent : IComputedIndexField
	{
		public object ComputeFieldValue(IIndexable indexable)
		{
			try
			{
				var sitecoreIndexable = indexable as SitecoreIndexableItem;

				if (sitecoreIndexable == null) return null;

				var pdfContent = new SitecorePdfParser(new MediaItem(sitecoreIndexable)).ExtractText().ToList();

				if (pdfContent.Count == 0) return null;

				return string.Join(" ", pdfContent);
			}
			catch (Exception e)
			{
				Log.Error("Unable to assemble PDF content for the search index ", e, this);
				return null;
			}
		}
	}

And finally wire it up to the indexer

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
	<sitecore>
		<contentSearch>
			<indexConfigurations>
				<defaultLuceneIndexConfiguration>
					<documentOptions>
						<fields hint="raw:AddComputedIndexField">
							<!-- indexes pdf contents into index _content field to allow PDF search -->
							<field fieldName="_pdfcontent" type="[NAMESPACE].IndexPdfContent, [DLL NAME]" />
						</fields>
					</documentOptions>
				</defaultLuceneIndexConfiguration>
				<defaultSolrIndexConfiguration>
					<documentOptions>
						<fields hint="raw:AddComputedIndexField">
							<!-- indexes pdf contents into index _content field to allow PDF search -->
							<field fieldName="_pdfcontent" type="[NAMESPACE].IndexPdfContent, [DLL NAME]" />
						</fields>
					</documentOptions>
				</defaultSolrIndexConfiguration>
				<defaultCloudIndexConfiguration>
					<documentOptions>
						<fields hint="raw:AddComputedIndexField">
							<!-- indexes pdf contents into index _content field to allow PDF search -->
							<field fieldName="pdf_content" cloudFieldName="pdf_content" type="[NAMESPACE].IndexPdfContent, [DLL NAME]" />
						</fields>
					</documentOptions>
				</defaultCloudIndexConfiguration>
			</indexConfigurations>
		</contentSearch>
	</sitecore>
</configuration>

Ending Results

Now we have our search index populated with PDF contents. So if someone wants to find a PDF with a text search it’s as simple as querying the index on the field assigned in the xml with the users search text.

Disclaimer

While this solution is quite good, it’s not perfect. If you have text in PDF images, it won’t find that. Additionally I’ve noticed that in rare cases words might be broken up when they’re being extracted. Presumably this is due to PDF formatting. If you happen to figure out how to resolve this completely, let me know and i’d love to update this code.