Search This Blog

Wednesday, August 25, 2010

Could not load the DLL xpstar90.dll, or one of the DLLs it references. Reason: 126(The specified module could not be found.)

Hi Team ,
This issue was faced by someone outside IBM but my main intention is to explain the benefit of another nice tool : Dependency Walker (http://www.dependencywalker.com/)

Issue :
SQL Server Agent failed to come up after the service account password was reset at AD level .

Error(s) :
In the event log you will see these errors in sequence :

Description:
Could not load the DLL xpstar90.dll, or one of the DLLs it references. Reason: 126(The specified module could not be found.).

Description:
Failed to retrieve SQLPath for syssubsystems population.

Description:
SQLServerAgent could not be started (reason: Failed to load any subsystems. Check errorlog for details.).

The first error is the main error and rest are the errors following the first error and we need not to think about them .

Troubleshooting and Resolution :
The error clearly says that either there is a problem with xpstar90.dll or the other dlls that this dll references .
This file is located in I first tried to re-register xpstar90.dll by using regsvr32 xrstar90.dll and got this message :

xpstar90.dll was loaded , But the dllRegisterServer entry point was not found.

I have heard that sometimes there is a different way of registering some DLLs , so by this error I did not come to the conclusion that this file is corrupt.
I was also thinking that there might be some other DDL that this DLL refers to , which got corrupted.

I decided to see the tree structure of xpstar90.dll in Dependency Walker . I opened the C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn\xpstar90.dll in it and go this output.



So, in this case XPSTAR90.DLL itself was corrupt .I found its version 2005.90.4035.0 and replaced it with another one that I had in another instance .
SQL Agent came online .
In case if it dos not , then we need to uninstall Native client from ADD Remove Programs and reinstall it .

Happy learning
Abhay

Monday, August 2, 2010

Error 1117 :The request could not be performed because of an I/O device error.

Our backups were failing under these conditions :

Scenario 1: The System databases plus few user databases are on local disk & few user databases are on LUNs.

Scenario 2: The System & user databases are completely on LUNs

The backups were running for some good amount of time but then use to fail with Error 1117.I know that taking backups on network is not suported but I was breaking my head on this ERROR (1117)to know the reason behind this error .After going through a few tests on my machine using external HDDs ,my understanding of this error is :


-> Error 1117 is ERROR_IO_DEVICE .Thats fine .But I was curious about knowing the situations under which this error might occur and what is the exact meaning on this Error .Does Error_IO_Device means that the Hardware is corrupt ? Found that this error occurs under the below situations and then found the reasons behind those situations as well :

STATUS_FT_MISSING_MEMBER
ERROR_IO_DEVICE

An attempt was made to explicitly access the secondary copy of information via a device control to the fault tolerance driver and the secondary copy is not present in the system.


STATUS_FT_ORPHANING
ERROR_IO_DEVICE
{FT Orphaning} A disk that is part of a fault-tolerant volume can no longer be accessed.


STATUS_DATA_OVERRUN
ERROR_IO_DEVICE
{Data Overrun} A data overrun error occurred.

STATUS_DATA_LATE_ERROR
ERROR_IO_DEVICE
{Data Late} A data late error occurred.


STATUS_IO_DEVICE_ERROR
ERROR_IO_DEVICE
The I/O device reported an I/O error

STATUS_DEVICE_PROTOCOL_ERROR
ERROR_IO_DEVICE
A protocol error was detected between the driver and the device.


STATUS_DRIVER_INTERNAL_ERROR
ERROR_IO_DEVICE
An error was detected between two drivers or within an I/O driver.


So this error mapping says that this error will be thrown out if anyof these conditions are met .In my situation we were falling in into STATUS_DATA_LATE_ERROR since we were also getting thses entries in the SQL serve errorlogs : "x I/O requests are pending for more then 15 secs ............filename.mdf"

If you are running backup jobs you might also get this error -1073548784 .
This is a common error and may come when the query you are running remotely is incorrect , or the table you are trying to drop does not exist .Try to export a table that already exists in another DB and you will recreate this OLEDB error.So we need not to worry about finding the message identifier for this number .


Action plan :
-----------------
--try to take backup of another database located remotely and of near about same size . I mean around 20GB.

--Run Chkdsk on this drive or ask someone to do that and see if the consistency errors come up .

--Create a similar database on another external drive like this one and take the backup .


Conclusion :
---------------
I am very much certain that the issue is with the drive and(OR)Network.The 15 sec IO delay messages in Errorlogs also suggests the same .But as you can see this error also comes when dataa gets late in reaching the destination (STATUS_DATA_LATE_ERROR) I am suspecting that the network might also be a bit slow and contributing to the backup failure .

Now the ball is in your court how you explain this to the client :) .

Hapy Learning