SEO Tutorial | Understanding Site Architecture with Xenu and Excel

One of the often under-considered factors for a strong SEO strategy is the site’s architecture and its affect on the PageRank flow through a site. We know that PageRank dissipates as we get further from the home page, but our conceptualization of page depth is clouded by the modern website’s sidebars, dynamically updating widgets, site-wide footers, and more. The difficulty in assessing a site’s architecture is compounded with very large sites with rich histories, especially as an SEO working for an agency where there is a need to develop a strong understanding of a new client’s site relatively quickly.

There are a few things the SEO can do to quickly get a sense of how the search engine might crawl and discover deep pages aside from opening Firefox, disabling JavaScript, setting User Agent to Googlebot, and clicking around starting from the home page. Let’s discuss a few of these options:

Google Webmaster Tools

One such tool that doesn’t fail to disappoint time and again is the Webmaster Tools ‘Internal Links’ report. Within this report we can get a look at which pages are linking to which, and get a count of internal links to any individual page. Unfortunately, the data is sampled (and often misleading) and does not report page depth.

Google Webmaster Tools Internal Link Report

Black Widow

Black Widow by SoftBytes will crawl through your site discovering new pages link by link. The biggest benefit of Black Widow is the Windows Explorer-like visualization of site architecture:

SoftBytes' Black Widow offers a nice site architecture visualization

Xenu

Xenu is a great tool for discovering broken links by doing a full site crawl much like Black Widow. Additionally, Xenu reports “LEVEL” and “LINKS IN”, which are particularly useful for developing a site architecture understanding. After running a full crawl, I like to import the results into Microsoft Excel and do some quick manipulation to rate the internal link juice that flows into each page. I have found that this has enabled me to get a quick sense of the site’s architecture in a pretty painless and repeatable manner. This process is what I’ll be detailing in this blog post:

Step 1: Run Your Crawl

We won’t need anything but links to internal pages crawled. You can speed up a crawl significantly be allowing Xenu to crawl only the pages that matter. That is, if you’d like to crawl only the www subdomain, you should specify so in Xenu, as crawling the root domain could take a lot more time.

Step 2: Import and Clean up Tab Separated File into Excel

After importing, remove all but the “Address”, “Type”, “Level”, and “Links In” columns. Next, we’ll remove all of the non-html pages by deleting all addresses that do not have a “Type” of text/html.

filter with excel
Use the above filter to show only non-text/html entries, then delete them all. Once the filter is removed, we’re left with just html pages.

Once we’ve done this we can delete the “Type” column, leaving us with “Address”, “Level” and “Links In”.

Because of various crawling oddities, many sites will include odd level counts. Unless you’re working with a MASSIVE site, most normal pages will not have a level higher than 10. Sort by level, and find that point where the levels begin to jump and/or non-important pages are crawled, and remove all thereafter.

Step 3: Assign a “Level” and “Links In” Score

Knowing that more PageRank flows to pages closer to zero (the home page), I use the following formula to score “Level”:

=1-(Table1[[#This Row],[Level]])/(MAX([Level])+AVERAGE([Level]))

The home page (level zero) will receive a score of 1, all of the level one pages will receive a score that is a fraction of 1, level two will be scored less than level one, and so on.

I score the “Links In” column using the following formula:

=Table1[[#This Row],[Links In]]/MAX([Links In])

This formula works similarly in that the strongest score is 1, and lower “Links In” counts will be a fraction of 1.

Scoring Internal Links with Xenu and Excel
Your Excel table should look something like this

Step 4: Rank Your Pages

Once we have our scores, we can add them together and/or use the RANK formula to get a quick reference number.

Excel Final Table
The higher total score or lower PageRank score indicates higher importance

Utility and Caveats

There are some obvious issues and shortcomings with this method of scoring internal pages. The most obvious is the lack of external link weight into the formula. It’s important to understand that our score is simply based on internal weight.

I have found, however, that it can be quite useful to have early in the life of a project as a reference. For instance, as I’m auditing a new site I can copy the URL of a page in question and do a quick CTRL+F in my Excel score sheet to get a quick feel for a page’s internal “importance”. Another great utility would be to compare these scores with other KPI, such as conversion rate or organic traffic. If you’ve got a page that converts like crazy, but has a poor internal link score, perhaps it should be moved closer to the home page, or linked to from more internal pages.

What helps you visualize site architecture? Let me know in the comments or on Twitter, @MikeCP.

Google Will Rank Shorter Content…If It’s Good

Do you find yourself spending a lot of time trying to pad out your Web pages with more words just so they will be ranked by Google or other search engines? Well, you might be glad to know you can stop that practice.

 

A common misconception about search engine optimization (SEO) is that a page must have at least 500 words (or some other arbitrary number) in order to even be considered to rank on the search engine results pages (SERPs), but Google’s John Mueller has recently stepped in to squash this popular, albeit incorrect, theory.

Mueller’s testimony comes from a thread on Google’s Webmaster Help forum entitled, “Is Short Content = Thin Content?” The Google employee stepped in to assure users that “Googlebot doesn’t just count words on a page or in an article.”

According to his post, Google’s focus is on finding and sharing “useful & compelling” content, which even shorter articles or bursts of content (such as tweets) are able to provide. This means that there isn’t a specific number of words or characters that automatically qualify a Web page for ranking consideration, but rather quality content.

That being said, one way to help get shorter articles noticed is to use them to generate discussions, as allowing users to share comments on an article is an easy way to include additional information on a page that doesn’t actually require any extra work. On occasion, this can be especially useful because “sometimes users are looking for discussions like that in search.”

Mueller wrapped up his post by reiterating that the best way to get ranked is to create truly unique, high-quality content, rather than material that is simply rewritten or autogenerated.

Originated from Website Magazine

Google wants to transform words that appear on page into entities that mean something

Biological network analysis (Social Signals)

With the recent explosion of publicly available high throughput biological data, the analysis of molecular networks has gained significant interest. The type of analysis in this context is closely related to social network analysis, but often focusing on local patterns in the network. For example network motifs are small subgraphs that are over-represented in the network. Similarly, activity motifs are patterns in the attributes of nodes and edges in the network that are over-represented given the network structure.

PageRank is a link analysis algorithm, named after Larry Pageand used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of “measuring” its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is referred to as the PageRank of E and denoted by PR(E).

The name “PageRank” is a trademark of Google, and the PageRank process has been patented (U.S. Patent 6,285,999). However, the patent is assigned to Stanford University and not to Google. Google has exclusive license rights on the patent from Stanford University. The university received 1.8 million shares of Google in exchange for use of the patent; the shares were sold in 2005 for $336 million.

Anchor

An anchor hyperlink is a link bound to a portion of a document—generally text, though not necessarily. For instance, it may also be a hot area in an image (image map in HTML), a designated, often irregular part of an image. One way to define it is by a list of coordinates that indicate its boundaries. For example, a political map of Africa may have each country hyperlinked to further information about that country. A separate invisible hot area interface allows for swapping skins or labels within the linked hot areas without repetitive embedding of links in the various skin elements.

Google Penguin is a code name for a Google algorithm update that was first announced on April 24, 2012. The update is aimed at decreasing search engine rankings of websites that violate Google’s Webmaster Guidelines  by using black-hat SEO techniques, such as keyword stuffing, cloaking, participating in link schemes, deliberate creation of duplicate content, and others.

Penguin’s effect on Google search results

By Google’s estimates, Penguin affects approximately 3.1% of search queries in English, about 3% of queries in languages like German, Chinese, and Arabic, and an even bigger percentage of them in “highly-spammed” languages. On May 25th, 2012, Google unveiled the latest Penguin update, called Penguin 1.1.  This update, according to Matt Cutts, was supposed to impact less than one-tenth of a percent of English searches. The guiding principle for the update was to penalise websites using manipulative techniques to achieve high rankings.

SERP Snippet/Optimizer Preview Tool

PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called “iterations”, through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.

A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly expressed as a “50% chance” of something happening. Hence, a PageRank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to the document with the 0.5 PageRank.

Simplified algorithm

Assume a small universe of four web pages: A, B, C and D. Links from a page to itself, or multiple outbound links from one single page to another single page, are ignored. PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial PageRank of 1. However, later versions of PageRank, and the remainder of this section, assume a probability distribution between 0 and 1. Hence the initial value for each page is 0.25.

The PageRank transferred from a given page to the targets of its outbound links upon the next iteration is divided equally among all outbound links.

If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.

PR(A)= PR(B) + PR(C) + PR(D).\,

Suppose instead that page B had a link to pages C and A, while page D had links to all three pages. Thus, upon the next iteration, page B would transfer half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C. Since D had three outbound links, it would transfer one third of its existing value, or approximately 0.083, to A.

PR(A)= \frac{PR(B)}{2}+ \frac{PR(C)}{1}+ \frac{PR(D)}{3}.\,

In other words, the PageRank conferred by an outbound link is equal to the document’s own PageRank score divided by the number of outbound links L( ).

PR(A)= \frac{PR(B)}{L(B)}+ \frac{PR(C)}{L(C)}+ \frac{PR(D)}{L(D)}. \,

In the general case, the PageRank value for any page u can be expressed as:

PR(u) = \sum_{v \in B_u} \frac{PR(v)}{L(v)},

Why Does SEO Take So Long?

Search marketing’s two halves are so similar and yet so different. A new pay-per-click advertising campaign can be set up and pushed live, and data starts to roll in that same day. Within a week there may be enough data to analyze performance, tweak and iterate the optimization of the campaign. Why does search engine optimization take so long to mature and impact performance if paid search is so quick? And what does this mean for the back to school or fall and holiday seasons that seem so far away?

In my agency life I’m asked this question frequently — followed by a query as to what to do to speed up the process. Let’s start with why search engine optimization takes longer to mature. The reasons have partly to do with education, business process, creative and development time, and the time it takes for the search engines to do their thing. All told, the process from kicking off a major SEO project with a thorough site analysis and detailed recommendations to actual implementation and SEO performance can take 6 to 12 months. If a business starts the process today, it will be lucky to see results by December.
SEO Is a Lengthy Process

The first step in the SEO process is the analysis of the site and development of recommendations detailed enough that the marketing and technical people who will need to implement them understand the issue, why it’s an issue and how to resolve it. An agency will usually take a month to do this, while an experienced in-house person focused solely on the project could probably accomplish it in two weeks.

Next, the recommendations need to be sold into the organization and prioritized against all the other marketing and development priorities. Depending on how well oiled a company’s business processes are, this can take a week or several months. Primarily, the trouble at this step tends to revolve around the difficulty in assigning a return-on-investment analysis to individual aspects of an SEO project. Using keyword research and web analytics, it’s possible to project the value of the SEO project as a whole, as described in “SEO: Estimating Sales Potential from Keywords and Phrases,” my previous article on that topic. But that estimate is for the whole project. What’s the value of 301 redirects? Or building 100 quality links? There’s no way to logically or even semi-accurately estimate the pieces of the project individually because they’re so tied together.
Assign SEO Priorities

The individual pieces of the SEO recommendations are often compared against other projects that have a more solidly predictable ROI attached to them. Frustrated SEO professionals can find their critical projects stuck in this loop for months, seeing little progress while feeling the pressure to produce results. The only way out of this conundrum I’ve found is to identify the top five or so absolutely critical aspects of the SEO recommendations, the pieces that all the rest build upon, and put those forward as the initial SEO project with the full ROI assigned to it. Yes, there will be additional work after those five critical elements get done, but they will be the recommendations that can be done more piecemeal for smaller incremental benefit with each implementation.

After the prioritization, the actual development work still needs to be done. This may be creative work, developing content for the site or link bait to encourage other sites to link and share interesting content. Or it may be platform-related development. Regardless, depending on the scope of the process, the creation and testing steps can take from a week to several months. At the end of this step, the SEO projects are ready to launch and the process is now out of the SEO professional’s hands.
Search Engine ‘Bots

Enter the search engine crawlers. Once the content or platform changes are implemented, the crawlers need to discover the changes, figure out what they mean, and rank the pages algorithmically against the other trillions of pages in the index. If a site doesn’t get crawled often, waiting to get crawled could take a month or more.

To see when a site was last crawled, check Google’s cache date by Googling, for example, “cache:www. practicalecommerce.com.” The resulting page will be a snapshot of the page the last time that Google crawled Practical eCommerce, with a date at the top in the gray field. In this case we see, “This is Google’s cache of http://www.practicalecommerce.com/. It is a snapshot of the page as it appeared on May 25, 2012 13:33:07 GMT. The current page could have changed in the meantime. Learn more.” Consequently we know that Google crawled the site today, and by checking the cache date tomorrow and the next day until the date changes we can see how often it gets crawled. If the cache date is a month past or older, the new content likely won’t be discovered for a while.
How Can SEO Be Sped Up?

Time Lamp

In this process of moving from analysis to recommendation to prioritization to creation to implementation to search engine algorithmic analysis to performance, the pieces a business can most affect are the first steps. The engines will take the time they need. If a business wants to perform faster for organic search, it needs to put more resources and more priority on SEO projects. Most businesses frustrated by lengthy SEO project timelines are standing in their own way. Those businesses should get out of the way by bending the ROI rules for SEO so that it can get prioritized for the resources it needs. I’m not saying that SEO should get a free pass, but give it the benefit of the doubt knowing that a hard or even soft ROI will be impossible for bits of the SEO project versus the whole project. Understanding this, give the SEO professional creative resources available to help with content creation and technical resources available to modify the server-level and platform-related issues.

SEO recommendations will have zero impact on the business’s traffic and sales until they escape the spreadsheet and are implemented on the site. All that wheel spinning just wastes internal resources debating and arguing while accomplishing no additional benefit to the bottom line.

Once the recommendations have been implemented, there are a couple of ways to speed up the search engines’ initial crawling. Submitting an updated XML sitemap to Google’s and Bing’s webmaster tools will prompt a fresh crawl of the site. Google also offers the ability to fetch a page as Googlebot, which can get individual URLs crawled immediately. These steps will speed up the time to crawl, but there is no way to speed up the time it takes the engines to re-evaluate and re-rank the content algorithmically. It will likely take several crawls of the new content while the engines determine if the change was temporary or a new tasty permanent addition to the site. Again, if a site is crawled frequently, this process will be faster. Sites that are frequently updated with fresh content or have more links are more likely to be crawled more frequently.
What About the Hot Selling Seasons?

It’s not too late to focus on the back-to-school and holiday selling seasons if you start now and are truly committed to organic search as a priority. Plan for three months for the search engines to do their part of the process and back out the timeline from there.

If the target performance date for organic search is September 1, the important SEO projects need to be live June 1. If the target performance date is November 1, the important SEO projects need to be live August 1. If the site gets crawled daily or several times a week, the three month search engine algorithmic analysis lead time could be shortened to one or two months. However, I recommend planning conservatively to avoid being caught too late watching the rankings, traffic and sales go to the competition.
Read More