My current lab environment has the following:

2 Web Front Ends which are sitting behind 2 Clustered Load balancers that I use. Query Role is also setup within this layer, so the Index are being written onto this server.

2 Application Servers that has been configured with 1 crawler per server.

The load balancers as well as the SharePoint serves are for this instance running off a single Esxi host. The reason I mention it here is that the content that is being crawled is huge and I had to tweak to get NO ERRORS πŸ™‚

Some of the errors that I noticed with this setup

 

Error-1

“An unrecognized HTTP response was received when attempting to crawl this item. Verify whether the item can be accessed using your browser.”

Across all my content Sources , Not a good sign

 

Error -2

“Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. If the repository being crawled is a SharePoint repository, verify that the account you are using has “Full Read” permissions on the SharePoint Web Application being crawled. ( Error from SharePoint site: HttpStatusCode Unauthorized The request failed with HTTP status 401: Unauthorized. )”

 

Error -3


“The secure sockets layer (SSL) certificate sent by the server was invalid and this item will not be crawled.”

 

Error-4

“Access is denied. Verify that either the Default Content Access Account has access to this repository, or add a crawl rule to crawl this repository. If the repository being crawled is a SharePoint repository, verify that the account you are using has “Full Read” permissions on the SharePoint Web Application being crawled. ( Error from SharePoint site: HttpStatusCode Unauthorized The request failed with HTTP status 401: Unauthorized. )”

 

The last error was one of the most annoying once that I had because , all the 3 content Sources were having successful crawls and incremental crawls except my SSL secured Anonymous Site which is an extended site of another content source.

Weird.

 

Typical things to check out for

  1. Check your crawl account has full read permission to access your content sources.
  1. Disable Loopback Check –
    1
    New-ItemProperty HKLM:\System\CurrentControlSet\Control\Lsa -Name "DisableLoopbackCheck" -value "1" -PropertyType dword

 

Fixes

All my content sources use HTTPS. That would mean all 4 content sources would either need 4 different IP addresses or use a wildcard certificate or since it’s a lab just one standard Self Signed Certificate. This would mean you are bound to hit the Certificate Error Issue.

Search out of the box doesn’t know how to get over this problem but can be configured to skip the SSL error.

Out of the box your Farm Search Configuration looks like this

Click on Ignore SSL warnings and change the setting to Yes

As mentioned before, the setup uses a cluster of Load balancers. Hence when either of the crawlers starts crawling the content from one of the APP servers, they naturally go hit the DNS which is pointed at the Load balancer cluster. #fail. πŸ™‚

The fix is to have a local HOST File entry on all your Crawl Servers, which in most scenarios are your Application Layer servers.

 

In a farm environment with Load balancers , networking equipment that go to sleep πŸ™‚ and DNS replication and every sort of complication under the sun you can think of , you may want to increase your Time-Out period under Farm Administration.

 

Peace!