Jay Harris is Cpt. LoadTest - SEO

dasControls v1.0, a Twitter Status Macro for dasBlog

Jay Harris — Thu, 01 Oct 2009 02:33:55 GMT

If you have read my post on Misconceptions on JavaScript Plugins and SEO, you know that search engines don't do JavaScript. Though these plugins and libraries (such as one for pulling your latest Twitter Updates) are nice for adding dynamic content for your users, they are just end-user flare and add nothing to your SEO rankings. They also put an unnecessary tax on your users, as each client browser is responsible for independently retrieving the external content; the time for your page to render is extended by a few seconds as the client must first download the JS library then make the JSON/AJAX request for your content.

In response to this, I have created dasControls, a library of custom macros for dasBlog (the blogging engine that powers www.cptloadtest.com). I have started with content that is driven by custom JavaScript libraries and convert the content and data retrieval into server-side controls. For now, dasControls contains only a Twitter Status macro, but I intend to add more controls in the coming months.

dasControls [Build 1.0.0.0] : Download | Project Page

dasControls TwitterStatus Macro

The TwitterStatus macro uses server-side retrieval of your Twitter data, eliminating all client-side JavaScript calls for your tweets. By placing the Twitter request on the server, the data is also available to any search engines that index your page. Additionally, data is cached on the server, and new updates are retrieved based on the polling interval you specify. When using real-time client-side JavaScript calls, there is a 2-5 second delay for your end-users while the data is retrieved from Twitter; by caching the data on the local server, this delay is eliminated, and the content for each user is delivered from the local cache, lightening the load for the end-user while avoiding an undue burden for high-traffic sites.

Macro Name: TwitterStatus
Macro Syntax: <% TwitterStatus("user name"[, number of tweets[, polling interval]])|dasControls %>

User Name : String. Your Twitter handle.
Number of Tweets : Integer. The number of tweets to retrieve and display. [default: 10]
Polling Interval : Integer. The number of minutes between each Twitter retrieval. [default: 5]

Relevant CSS:

TwitterStatusItem : CSS class given to each Tweet, rendered as a DIV.
TwitterStatusTimestamp : CSS class given to each Tweet's timestamp ("32 minutes ago"), rendered as an inline SPAN within each Tweet element.

Using the Macro within a dasBlog Template

This macro is for use in the dasBlog HomeTemplate. The macro works just like any out-of-the box macro, except that you must also include the alias specified within dasControls entry the web.config (the value of the "macro" attribute). Your twitter handle is required, though you can also optionally include the number of Tweets to pull from Twitter (default: 10) and the number of minutes between each Twitter data request (default: 5). Because everything happens on the server, there is no need to include any of the Twitter JSON JavaScript libraries or HTML markup.

<% TwitterStatus("jayharris", 6, 5)|dasControls %>

Installation and Setup of dasControls

Download dasControls, extract the assembly into your dasBlog 'bin' directory.

dasControls [Build 1.0.0.0] : Download | Project Page

Enable Custom Macros within your dasBlog installation, and add the Twitter macro to your list of Custom Macros.
First, ensure that the <newtelligence.DasBlog.Macros> section exists within your web.config:

<newtelligence.DasBlog.Macros>
  <!-- Other Macro Libraries -->
</newtelligence.DasBlog.Macros>

Second, ensure that the Macros Configuration Section is defined in to your web.config <configSections>:

<configSections>
  <!—Other Configuration Sections -->
  <section requirePermission="false" name="newtelligence.DasBlog.Macros"
    type="newtelligence.DasBlog.Web.Core.MacroSectionHandler,
      newtelligence.DasBlog.Web.Core" />
</configSections>

Third, add the dasControls library entry to the dasBlog Macros section:

<newtelligence.DasBlog.Macros>
  <add macro="dasControls"
    type="HarrisDesigns.Controls.dasBlogControls.Macros,
      HarrisDesigns.Controls.dasBlogControls"/>
</newtelligence.DasBlog.Macros>

Roadmap for dasControls

In the upcoming weeks and months, I plan on adding additional macros to the dasControls library, including Delicious, Google Reader's Shared Items, and Facebook. If you're interested in any others, or have any ideas, please let me know.

Technorati Tags: dasBlog,dasControls,Open Source,CodePlex,SEO,Twitter

Misconceptions on JavaScript Plugins and SEO

Jay Harris — Mon, 31 Aug 2009 12:47:29 GMT

Search Engine Optimization is high on the radar, right now. Whether it be the quest for the first Coupon site in Bing, the highest Cosmetics site on Google, or the top-ranked "Jay Harris" on every search engine, the war is waged daily throughout the internet. For companies, it's the next sale. For people, it's the next job. Dollars are on the line in a never-ending battle for supremacy.

One of the contributing factors in your Search Engine Ranking is Content. Fresh, new content brings more search engine crawls. More crawls contributes to higher rankings. Search engines like sites that are constantly providing new content; it lets the engine know that the site is not dead or abandoned. And though this new content idea works out well for the New York Times and CNN, not everyone has a team of staff writers who are paid to constantly produce new content. So we shortcut. We don't so much have to have new content as long as we make Google think we have new content. There are hundreds if not thousands of JavaScript plugins out there to provide fresh content to our readers, ranging from Picasa photos, to Twitter updates, to AdWords, to Microsoft Gamercard tags. But I have to let you in on a little secret:

JavaScript Plugins do nothing for SEO.
Nothing.
Search engine spiders don't do JavaScript.

"This must be a lie. When I look at my site, I see my new photos, or my new tweets, or my new Achievement Points; why don't the spiders see it, too?" Well, it's true. Google Spiders, and most other Search Engine Spiders, don't do JavaScript, which is why JS provides no SEO contribution; spiders do not index what they do not see. A look through your traffic monitor, like Google Analytics, will often show a disparity between logged traffic and what is actually accounted for in Web Server logs. Analytics, a JavaScript-based traffic monitor, only logs about 40% of the total traffic to this site (excluding traffic to the RSS feed), which means that the other 60% of my visitors have JavaScript disabled. A JavaScript Disabled on 60% of all browsers seems like a ridiculously high percentage unless you consider that Spiders and Bots do not execute JavaScript.

Just like Google doesn't see the pretty layout from your stylesheet, Google also doesn't see the dynamic content from your JavaScript. Pulling down HTML, (since it is all just text, anyway) is easy; there's not even a lot of overhead associated with parsing that HTML. But add in some JavaScript, and suddenly there's a lot more effort involved in crawling your page, especially since there is a lot of bad JavaScript out there. So search engines just check what has been written into your HTML. They read the the URL, the keywords and META description, but only the content as rendered by the server. JavaScript is not touched, and JavaScript-based content is not indexed.

So how do you get around this? How do you get this SEO boost, since JavaScript isn't an available option?

Use plug-ins and utilities that pull your dynamic data server-side, rather than client-side. Create a custom WebControl that will download and parse your latest Twitter updates. Create a dasBlog macro to create your Microsoft Gamertag. By putting this responsibility on the server, not only will you make life easier on your end user (one less JavaScript library to download), but you also make this new content available to indexing engines, which can only help your Google Juice.

Update:

I've been working on a set of macros for dasBlog to start pulling my dynamic content retrievals to the server. Keep an eye out over the next couple of days for the release of my first macro, a Twitter Status dasBlog macro that will replace the need for the Twitter JS libraries on your site.

Technorati Tags: JavaScript,SEO,Twitter

Virtues and Villainy of Robots.txt

Jay Harris — Fri, 15 May 2009 13:31:37 GMT

You may have heard of Robots.txt. Or, you may have seen requests for /Robots.txt in your web traffic logs, and if the file doesn't exist, a related HTTP 404. But what is this Robot file, and what does it do?

Introduction to Robots.txt

When on a web server, Robots.txt is a file that directs Robots (a.k.a. Spiders or Web Crawlers) on which files and directories to ignore when indexing a site. The file is located on the root directory of the domain, and is typically used to hide areas of a site from search engine indexing, such as to keep a page off of Google's radar (such as my DasBlog login page) or if a page or image is not relevant to the traditional content of a site (maybe a mockup page for a CSS demo contains content about puppies, and you don't want to mislead potential audience). Robots request this file prior to indexing your site, and its absence indicates that the robot is free to index the entire domain. Also, note that each sub-domain uses a unique Robots.txt. When a spider is indexing msdn.microsoft.com, it won't look for the file on www.microsoft.com; MSDN will need its own copy of Robots.txt.

How do I make a Robots.txt?

Robots.txt is a simple text file. You can create it in Notepad, Word, Emacs, DOS Edit, or your favorite text editor. Also, the file belongs in the root of the domain on your web server.

Allow all robots to access everything:

The most basic file will be to authorize all robots to index the entire site. The asterisk [*] for User Agent indicates that the rule applies to all robots, and by leaving the value of Disallow blank rather than including a path, it effectively disallows nothing and allows everything.

# Allow all robots to access everything
User-agent: *
Disallow:

Block all robots from accessing anything:

Conversely, with only one more character, we can invert the entire file and block everything. By setting Disallow to a root slash, every file and directory stemming from the root (in other words, the entire site) will be blocked from robot indexing.

# Block all robots from accessing anything
User-agent: *
Disallow: /

Allow all robots to index everything except scripts, logs, images, and that CSS demo on Puppies:

Disallow is a partial-match string; setting Disallow to "image" would match both /images/ and /imageHtmlTagDemo.html. Disallow can also be included multiple times with different values to disallow a robot from multiple files and directories.

# Block all robots from accessing scripts, logs,
#    images, and that CSS demo on Puppies
User-agent: *
Disallow: /images/
Disallow: /logs/
Disallow: /scripts/
Disallow: /demos/cssDemo/puppies.html

Block all robots from accessing anything, except Google, which is only blocked from images:

Just as a browser has a user agent, so does a robot. For example, "Googlebot/2.1 (http://www.google.com/bot.html)", is one of the user agents for Google's indexer. Like Disallow, the User-agent value in Robots.txt is a partial-match string, so simply setting the value to "Googlebot" is sufficient for a match. Also, the User-agent and Disallow entries cascade, with the most specific User Agent setting is the one that is recognized.

# Block all robots from accessing anything,
#    except Google, which is only blocked from images
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /images/

Shortcomings of Robots.txt

Similar to the Code of the Order of the Brethren, Robots.txt "is more what you'd call 'guidelines' than actual rules." Robots.txt is not a standardized protocol, nor is it a requirement. Only the "honorable" robots such as the Google or Yahoo search spiders adhere to the file's instructions; other less-honorable bots, such as a spam spider searching for email addresses, largely ignore the file.

Also, do not use the file for access control. Robots.txt is just a suggestion for search indexing, and will by no means block requests to a disallowed directory of file. These disallowed URLs are still freely available to anyone on the web. Additionally, the contents of this file can be used to against you, as it the items you place in it may indicate areas of the site that are intended to be secret or private; this information could be used to prioritize candidates for a malicious attack with disallowed pages being the first places to target.

Finally, this file must be located in the root of the domain: www.mydomain.com/robots.txt. If your site is in a sub-folder from the domain, such as www.mydomain.com/~username/, the file must still be on the root of the domain, and you may need to speak with your webmaster to get your modifications added to the file.

Other Resources:

The Web Robots Pages : http://www.robotstxt.org
Wikipedia : http://en.wikipedia.org/wiki/Robots.txt

Technorati Tags: SEO,Robots.txt

URL Rewrite, Part 3: Improving SEO and the 'www' subdomain

Jay Harris — Thu, 04 Dec 2008 21:43:10 GMT

Did you know that yourdomain.com and www.yourdomain.com are actually different sites? Are they both serving the same content? If so, it may be negatively impacting your search engine rankings.

Subdomains and the Synonymous 'WWW'

Sub-domains are the prefix to a domain (http://subdomain.yourdomain.com), and are treated by browsers, computers, domain name systems (DNS), search engines, and the general internet as separate, individual web sites. Google's primary web presence, http://www.google.com, is very different than Google Mail, http://mail.google.com, or Google Documents, http://docs.google.com, all because of subdomains. However, what many do not realize is that www is, itself, a subdomain.

A domain, on its own, requires no www prefix; a subdomain-less http://yourdomain.com should be sufficient for serving up a web site. And since www is a subdomain, dropping the prefix could potentially return a different response. There are some sites that will fail to return without the prefix, and some sites that fail with it, but the most common practice is that the www subdomain is synonymous for no subdomain at all.

The Synonymous WWW and SEO

The issue with having two synonymous URLs (http://yourdomain.com and http://www.yourdomain.com) is that search engines may interpret them as separate sites, even if they are serving the same content. The two addresses are technically independent and are potentially serving unique content; to a cautious search engine, even if pages appear to contain the same content, there may be something different under the covers. This means your audience's search results returns two entries for the same content. Some users will happen to click on yourdomain.com while others navigate to www.yourdomain.com, splitting your traffic, your page hits, your search ranking between two sites, unnecessarily.

HTTP Redirects will cure the issue. If you access http://google.com, your browser is instantly redirected to http://www.google.com. This is done through a HTTP 301 permanent redirect. Search Spiders recognize HTTP response codes, and understand the 301 as a "use this other URL instead" command. Many search engines, such as Google, will then update all page entries for the original URI (http://yourdomain.com) and replace it with the 301's destination URL (http://www.yourdomain.com). If there is already an entry for the destination URL, the two entries will be merged together. The search entries for yourdomain.com and www.yourdomain.com will now share traffic, share page hits, and share search ranking. Instead of having two entries on the second and third pages of search results, combining these entries may be just enough to place you on the first page of results.

In addition to combining search entries for subdomains, you can also combine root-level domains through HTTP 301. On this site, in addition to adding the www prefix if no subdomain is specified, captainloadtest.com will HTTP 301 redirect to www.cptloadtest.com.

Combining the Synonyms

We need a way to implement an HTTP 301 redirect at the domain level for all requests to a site; however, often we are using applications that may not grant us access to the source, or we don't have the access into IIS through our host to set up redirects for ourselves. URL Rewrite, Part 2 covers a great drop-in redirect module by Fritz Onion that uses a stand-alone assembly with a few additions in web.config to HTTP 301 redirect paths in your domain (it also supports HTTP 302 redirects). This module is perfect for converting a WordPress blog post URL, such as cptloadtest.com/?p=56, to a DasBlog blog post URL like cptloadtest.com/2006/05/31/VSNetMacroCollapseAll.aspx. However, to redirect domains and subdomains, the module must go a step further and redirect based on matches against the entire URL, such as directing http:// to https:// or captainloadtest.com to cptloadtest.com, which it does not support. It's time for some modifications.

private void OnBeginRequest(object src, EventArgs e) {
  HttpApplication app = src as HttpApplication;
  string reqUrl = app.Request.Url.AbsoluteUri;
  redirections redirs
    = (redirections) ConfigurationManager.GetSection("redirections");

  foreach (Add a in redirs.Adds) {
    Regex regex = new Regex(a.targetUrl, RegexOptions.IgnoreCase);
    if (regex.IsMatch(reqUrl)) {
      string targetUrl = regex.Replace(reqUrl, a.destinationUrl, 1);

      if (a.permanent) {
        app.Response.StatusCode = 301; // make a permanent redirect
        app.Response.AddHeader("Location", targetUrl);
        app.Response.End();
      }
      else
        app.Response.Redirect(targetUrl);

      break;
    }    
  }
}

By converting app.Request.RawURL to app.Request.AbsoluteUri, the regular expression will now match against the entire URL, rather than just the requested path. There is one downside to this change: the value is the actual path processed, not necessarily what was in the originally requested URL. To this effect, the value of AbsoluteUri for requesting http://www.cptloadtest.com?p=56 is actually http://www.cptloadtest.com/default.aspx?p=56; by requesting the root directory, the default page is being processed, not the directory itself, so default.aspx is added to the URL. Keep this in mind when setting up your redirection rules. Also, the original code converted the URL to lower case; with my modifications, I chose to maintain the case of the URL, since sometimes case matters, and instead ignore case in the regular expression match using RegexOptions.IgnoreCase. Finally, I made some other minor enhancements, like using the ConfigurationManager, since ConfigurationSettings is now obsolete, and reusing the matching Regex instance for replacements.

Download: RedirectModule.zip

Includes:

Source code for the drop-in Redirect Module

Sample web.config that uses the module

Compiled version of redirectmodule.dll

The code is based on the original Redirect Module by Fritz Onion and the Xml Serializer Section Handler by Craig Andera. As always, this code is provided with no warranties or guarantees. Use at your own risk. Your mileage may vary. Thanks to Fritz Onion for the original work, and allowing me extend his code further.

The usage is the same as Fritz Onion's original module. Drop the assembly into your site's bin, and place a few lines into the web.config. The example below contains the rules as they would apply to this site, 301 redirecting http://www.captainloadtest.com to http://www.cptloadtest.com, and adding the www subdomain to any domain requests that have no subdomain.

<?xml version="1.0"?>
<configuration>
  <configSections>
    <section name="redirections"
      type="Pluralsight.Website.XmlSerializerSectionHandler, redirectmodule" />
  </configSections>
  <!-- Redirect Rules -->
  <redirections type="Pluralsight.Website.redirections, redirectmodule">
    <!-- Domain Redirects //-->
    <add targetUrl="captainloadtest\.com/Default\.aspx"
      destinationUrl="cptloadtest.com/" permanent="true" />
    <add targetUrl="captainloadtest\.com"
      destinationUrl="cptloadtest.com" permanent="true" />

    <!-- Add 'WWW' to the domain request //-->
    <add targetUrl="://cptloadtest\.com/Default\.aspx"
      destinationUrl="://www.$1.com/" permanent="true" />
    <add targetUrl="://cptloadtest\.com"
      destinationUrl="://www.$1.com" permanent="true" />

    <!-- ...More Redirects -->
  </redirections>
  <system.web>
    <httpModules>
      <add name="RedirectModule"
        type="Pluralsight.Website.RedirectModule, redirectmodule" />
    </httpModules>
  </system.web>
</configuration>

The component is easy to use, and can redirect your site traffic to any URL you choose. Neither code changes to the application nor configuration changes to IIS are needed. By using this module to combine synonymous versions of your URLs, such as alternate domains or subdomains, you will improve your page ranking through combining duplicate search result entries. One more step towards your own search engine optimization goals.

URL Rewrite

Part 1: HTTP 302, Temporarily Relocated
Part 2: HTTP 301, Moved Permanently
Part 3: Improving SEO and the 'www' subdomain

Technorati Tags: SEO,Search Engine Optimization,Page Rank,HTTP 301,ASP.NET,Fritz Onion