Yukon-Katmai Discussion

Replication : Latency but no Latency

2012-01-18T10:32:00.002+05:30

I am sure those of you who have been through the replication latency issues know that these issues are not easy to crack .Gaining expertise in Replication especially issues related to latency cannot be learned or taught in a class.

Mostly, issues related to latency are due to :

1: Blocking on DIST/PUB/SUB
2: High Resource consumption : CPU,Storage,Memory
3: Network issues
4: Huge size of Distribution database (msrepl_transactions and msrepl_comands)
5: Transaction log of publisher is huge (too many VLFs) causing log reader latency .

-I am sharing this one with you where actually there was no latency due to any of these as I
mentioned above.
-There was no blocking
-CPU,memory and Diskes were doing good .
-There was no network issue
-We shrunk the publisher database which did not help us at all .
-Cleanup job was running fine .We ran update stats with full scan on msrepl_transactions and
msrepl_comands which did not help either .

When latency started piling up , we first had a look at the log reader and distributor agent history in distribution database but could not get much information due to another issue (out of scope of this post) because of which the log reader history was not showing up and distributor history was showing that almost no data is coming to subscriber ..Normally every 5 mins the throughput of logreader and distribution agent threads (reader and writer threads) are written in to mslogreader_history and msdistribution_history system tables .But we were getting false entries there .

We had no choice but to take the verbose log for distribution and log reader agents .From the logs it was clear that the log reader agent was delivering more than 1000 commands/sec while distribution agent was delivering less than 100 cmds/sec .

We were totally clue less because :
- Log Reader was running at its usual speed .So there was no issue with the Publisher .Nor there was any issue in pumping the data to Distributor.The log Reader Verbose log cleared it .So this was ruled out .

- Data from Distributor to Subscriber was slow .But we were not able to find out whether its slow in reading the data from distribution db or slow in pumping the data to the subscriber .db .What we could clearly see was that there was almost no activity on the subscriber .

-No Resource bottlenecks .Its just that something was stopping the data to flow from distributor to subscriber .This was like when we see an hour glass where there is a lot of sand at the top but the hole is so small that very less can go out of it .

I then saw the distribution verbose logs again and found that there were very frequent entries related to committing the transactions .The 5 similar sentences in 5 continuous lines were not very much clear but I could feel that something related to committing transactions was happening .But still very unclear .These entries were repeating in a group of 5-6 very frequently .The time consumption was around 2 seconds for each spell of 5-6 entries .

This gave us some hint and we immediately jumped to the Distribution agent profile to see if someone has not modified any setting which might be causing this issue.We did a mistake again because we just right clicked replication >> Distributor properties >> agent profiles >> Distribution agent .This was looking fine .

However, one of our colleagues opened up the profile of the distribution agent that was giving us the issue .This was the right way guys .We needed to do this . We found that -CommitBatchSize (default Value 100) and -CommitBatchThreshold(default Value 1000) values were changed to 10 each .

We then changed it back to the default and recycled the Distribution Agent .Thats it , the story ends here .
Due to this the rate at which the transactions were delivered at 1 cmd/transaction .

Even though it was a small setting that caused us slogging for hours , but the experience of working on the replication latency issue when there was actually no resource bottleneck (I call it as false latency) was amazing ..

Cheers and Happy Learning ..

Can restoring a database to another instance reduce Index fragmentation of underlying tables ?

2012-01-14T13:29:00.001+05:30

Answer is NO ..but to come to this conclusion , I had to spend some time .One of my colleagues came to me with this question .My instant answer was a clear NO ...But then I asked him with a curiosity the reason for asking this question .As per him the nightly index reorg job that should run for a very long time , finished in just 3 hours .Also we cannot check the index fragmentation since it takes around 2 hours .Our tables are huge ..

I thought of 3 reasons :

1) Since there is a logic in our job to do re-org only if there is certain level of fragmentation , that day there might be no index coming in the rebuild category.This possibility was less but cannot be ruled out .

2) The restore actually reshuffled the pages and in that attempt cleared out some leaf level fragmentation .I remembered that restores take more time than backups .So I started believing this .

3) There might be some Re-org activity happening during the time the backups were happening .The backup might have a copy of well re-orged pages and this might have resulted in a less fragmented database .

I first started off with point 2 and soon realized that I was not correct .This did not take me much time .For point one , we added the log in the job so that as the job finishes , it creates the log which we can read .But this will take time to generate.

Now , I was left with option 3 .I had the table with > 99% fragmentation and 24085822 rows .The table size was around 4GB . DBCC Showcontig output is shared below :

DBCC SHOWCONTIG scanning 'stats_test' table...
Table: 'stats_test' (711673583); index ID: 1, database ID: 6
TABLE level scan performed.
- Pages Scanned................................: 268007
- Extents Scanned..............................: 33569
- Extent Switches..............................: 268005
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 12.50% [33501:268006]
- Logical Scan Fragmentation ..................: 99.22%
- Extent Scan Fragmentation ...................: 0.01%
- Avg. Bytes Free per Page.....................: 3334.3
- Avg. Page Density (full).....................: 58.81%
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

I then ran Index Reorg and noticed percent_complete column in sys.dm_exec_requests DMV .When it reached 46% , I kicked off the backup of the database .

Now let me restore the backup .....fingers crossed :) ....

DBCC SHOWCONTIG scanning 'stats_test' table...

Table: 'stats_test' (711673583); index ID: 1, database ID: 6

TABLE level scan performed.

- Pages Scanned................................: 158633

- Extents Scanned..............................: 19862

- Extent Switches..............................: 19866

- Avg. Pages per Extent........................: 8.0

- Scan Density [Best Count:Actual Count].......: 99.81% [19830:19867]

- Logical Scan Fragmentation ..................: 0.15%

- Extent Scan Fragmentation ...................: 3.47%

- Avg. Bytes Free per Page.....................: 51.2

- Avg. Page Density (full).....................: 99.37%

DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Mystery solved ....

Happy learning ..

Replicating SP execution : Issue in SQL Server 2005 SP3 CU2 .Works fine in SQL 2008 SP1

2012-01-13T12:25:00.000+05:30

posting after a long time gap and might still have not posted until yesterday when I got to know how smart Replication is .

Brief Summary :
We have a very large OLTP environment where millions of small queries do inserts and updates (No deletes) .The same is replicated to other subscribers .the data is so much that most of the time we firefight latency . So because of the size of the data we started archiving ,which also started deleting the data in batches from OLTP environment and replicating the same to the subscribers .This further added to latency for obvious reasons.

So to reduce latency ,we started thinking of replicating the execution of Stored Proc that deletes the rows in batches .No, I am not trying to say the just because we can replicate the SP execution , that the replication is smart .This feature is quite old now and perhaps you all might be aware of this already.

Issue that we thought we might face :
We already have had all the required tables added as articles in the respective publications .And now if we add the Stored procedures in the publication then we thought Replication will try to update and insert the data twice . For example Lets say there are 2 articles in the publication .The first one is a table (lets say REPL_TAB) and the second one is a SP (say REPL_SP) .REPL_SP deletes x rows from REPL_TAB.

Now, if we execute REPL_SP , we thought that it should affect the Subscriber table twice .One when the SP deletes the rows and two , since the rows are being deleted REPL_TAB should also replicate the same .So we thought this might not work .We then thought of creating another publication with this SP added as an article but had same reservations .

But logically , I thought that Log Reader agent should pickup the command from the T-Log and should be smart enough to replicate it once .I mean if I run EXEC XYZ which deletes 10 rows in a table ABC ,then it should only replicate EXEC XYZ and not the delete command because the rows are being deleted from the table and that table is also an article in the same or for that sake different publication .

I first tested this in SQL Server 2005 SP3 CU2 and got it partially working .In the same publication If I have both the articles the executing the stored procedure will fill msrepl_commands and merepl_transactions with 1 row each .But If there are 2 publications with one article in each and I execute the SP to delete x rows its replicated twice .First 1 command and 1 transactions and then x commands and 1 transactions .Distribution history confirms the same .

I then tested the same on SQL Server 2008 SP1 and it worked like a charm .Below is the proof of concept for your reference :

Test 1:
We have 2 publications on Adventureworks database .One is publishing a table and the other is a stored procedure execution (by default Stored procedure execution is not enabled).Stored Procedure Del_stats_scan, deletes top 10 rows from table dbo.Stats_State in the Publisher database Adventureworks)

SQL Server verion :
Microsoft SQL Server 2008 (SP1) - 10.0.2531.0 Evaluation Edition on Windows XP SP3

Publisher database : Adventureworks
|_Table :dbo.Stats_State
|_SP:dbo.del_stats_scan
|_Publication :ADV_Table
| |_Article :dbo.Stats_State
|_Publication:ADV_SP
|_Article :dbo.del_stats_scan

Subscriber Database : ADV_SUB
|_Table :dbo.Stats_State
|_SP:dbo.del_stats_scan

Jobs :
Log Reader Agents :1
Snapshot Agent :2
Distribution Agent :2

Replication is in synchronization :

Queries fired on publisher DB :

Query 1

delete top (10) from dbo.stats_test

Results :

Data replicated only once .The second Distribution agent did not Do anything . The reason you are seeing 2 transactions in log reader is because both are same images .Log Reader for one database is only ONE .

Fire these queries to find out what is being replicated :

select * from distribution.dbo.MSrepl_commands

select * from distribution.dbo.MSrepl_transactions

Query 2 :

exec del_stats_scan

Results :Data Replicated only once.Only the SP executed and replicated .The other distribution agent did not do any thing . The reason you are seeing 2 transactions in log reader is because both are same images. Log Reader for one database is only ONE .

Run these queries to find out what has been replicated

select COUNT(*) 'No. of rows in Repl_cmds' from distribution.dbo.MSrepl_commands

select COUNT(*) 'No. of rows in Repl_Trans'from distribution.dbo.MSrepl_transactions

Test 2:

We have 1 publication on Adventureworks database with 2 articles.One article is publishing a table and the other article is publishing the stored procedure execution (by default Stored procedure execution is not enabled).

Stored Procedure Del_stats_scan, deletes top 10 rows from table dbo.Stats_State in the Publisher database (Adventureworks).

SQL Server verion :

Microsoft SQL Server 2008 (SP1) - 10.0.2531.0 Evaluation Edition on Windows XP SP3

Publisher database :

Adventureworks

|_Table :dbo.Stats_State

|_SP:dbo.del_stats_scan

|_Publication :ADV_Table

|_Article :dbo.Stats_State

|_Article :dbo.del_stats_scan

Subscriber Database : ADV_SUB

|_Table :dbo.Stats_State

|_SP:dbo.del_stats_scan

Jobs :

Log Reader Agents :1

Snapshot Agent :1

Distribution Agent :1

Repeat Test 1 and 2 and see the results .

Conclusion :

SQL Server 2008 Replication(log Reader) is smart enough and replicate data only once from the transaction log to Distributor .Distributor then distributes the command to the subscriber.There is a bug in SQL Server 2005 SP3 CU2 where the second test works fine but not the first test and replicate twice if we execute the SP. You will have to find out which CU in 2005 fixed this or might want to directly apply SP4.

Happy Learning .

What does resourcedb contain ?

2011-07-22T13:49:00.000+05:30

Many a times this question has been asked (either in interviews or just out of curiosity) that what does resourcedb contain and why is it so important to SQL Server ?

Normally we can't see it and hence can't use it .However, there is a way you can use resource database .But be careful. If you mess up anything you might end up paying a heavy cost.

Since we are discussing about this ,there is one more point that I would like to touch here .Starting from SQL 2005 there are no system tables but DMVs for us to query .However, if you query sys.objects and filter it on type ='S' ,you will notice a lot of system tables listed in the output .So there are system tables and we can see them .But if you try to query them ,you will get an annoying 208 error stating that the object does not exist which is not correct.So in this post we will see how to query resource database and also in a similar manner other system and user database .

Let us see how can we use resource database and also query system tables.Start SQL Server with -m switch ( in single user mode ) . There are 2 ways :

1) Through services console (after adding -m do not click on OK but click on start )

2) Through DOS prompt

Once SQL Server has been started in single user mode , we can make only one connection . We will connect to SQL Server using DAC .DAC option can only be used in sqlcmd utility and not in OSQL or ISQL .Again there are two ways to do this .But before trying to attempt for DAC connection make sure you have enabled remote admin connection option via sp_configure (you can see the run value of 1):

1) connecting to SQL Server with DAC (using SQLCMD) using -A option

2)Connecting via MGT Studio
Open Management studio , It will prompt you to enter the instance name . Just before the instance name add admin:

you might or might not get this error :

If you get this error then click OK ( this error window will go away ) and then instead of Connect click on cancel .
you will see a clean screen like below :

Click on new query

You will again see the same connection popup :

This time click on connect and it will work :) .A new query window will open even though you will not see the databases in the left hand side pane but the connection is there and working ...

Run the query 'use mssqlsystemresource' and press F5 :) ...It will work :

Also if you query sys.sysdbreg sytem table (an alternative for sys.sysdatabases DMV )you will see the resource database :

This database is currently in read-only mode (trust me :-) ) . If you want to cross check this run dbcc shrinkdatabase (mssqlsystemresource) and you will get to know .
You can set it to read_write mode though by running : alter database mssqlsystemresource set read_write.Now the very fact that this table is Read-Only and that we cannot take the backup of this database ,proves that this database might not contain very important information ...if you query the tables you will see that certain static information is stored which the engine might be using from time to time .Something like we store the values in a temp table or a variable .So coming back to the original question , resource database contain a lot of static information which the engine might need from time to time for its internal use .

At the end you might ask "Whats the need to touch the system tables in the database ?"
The answer is : We normally do not need to do this (especially Resource DB ) but there are other databases (system as well as user ) having some information we can use to resolve some issues (by updating those tables as needed ). And to resolve those issues , we need to login in this way ....

Hope you have found it interesting .But remember , BE VERY CAREFUL WHEN YOU TRY TO PLAY AROUND WITH SYSTEM TABLES (as i said in the beginning).
Happy Learning !!

Replication :Archiving partitioned and non-Partitioned tables (Without removing the articles from the publications)

2011-06-26T11:16:00.000+05:30

Recently there was a request on the MSDN forums where the poster wanted to archive the replicated partitioned tables in the publisher database . I think it would be good to share the solution with everyone in this forum as well .
In this post we will see :

Part 1) archiving the replicated non-partitioned (normal) tables .
Part 2) archiving the replicated partitioned tables .

At the end ,you will notice one nice to know feature of partitioned tables .

Part 1) Archiving the replicated non-partitioned (normal) tables .
Publisher : DB2Migration
Subscriber : DB2Migration_Sub
Replication Topology : Transactional Replication
Articles : dbo.Test
Other details : Both tables have 10000 rows each after first synchronization.

Now we need to archive the test table in publisher database but want to keep the Subscriber untouched i.e. the rows in subscriber should not change .For this example we will delete all the rows of the publisher table .

How should we do it ?
If I delete any row on publisher , the same will be replicated to subscriber . One way might be to stop the log reader agent and then delete the rows .After this , I can fire sp_repldone on the publisher and start the logreader agent .yes this is perfectly achievable . Here we go ...

DBCC opentran will show :
No active open transactions.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

We will enable the LogReader agent now and will see that there are no transactions to be replicated.

Select count * from both the tables will show 0( zero ) and 10000 rows respectively .

After this we will insert 1000 rows in publisher table ( we need to be careful as the tables have primary key).As a result the subscriber now, has 11000 rows and publisher has 1000 rows .This is going to be costly when there will be millions of rows because delete (or update or insert )is a logged activity.There is one more drawback and perhaps more critical.The log reader agent is one per database .So if there are more than one publications on the same database and we run sp_repldone , then we will hurt other subscriptions and publications .So we have to be careful. Other way is to truncate the table (after moving the data to an archived table) but replicated tables cannot be truncated (why ???.....simple , truncate is a non-logged activity and log-reader agent reads the log file to find the transactions marked for replication using sp_replcmds).So to truncate the table ,you need to remove the article from the publication .If you want to do that the steps are :

*******TEST THIS BEFORE IMPLEMENTING IT IN PRODUCTION********
1) Stop the log reader agent and distribution agent
2) Drop the article(s) from the publication
3) Archive the table to another table ( this will be a logged activity ) by Bulk insert or BCP or import export wizard
4) Truncate the table 5) Add the article again
6) Change the properties of all the articles in the publication properties to "Keep existing object unchanged" for option action "if name is in use " .This is the most important step and please cross check it a few times to make sure that "Keep existing object unchanged " is set
6) generate the snapshot again ....
7) start the log reader agent and distribution agent and initiate the new snapshot ....
*******TEST THIS BEFORE IMPLEMENTING IT IN PRODUCTION********
Trust me .you are done :) ..But don't you think its lengthy and a bit risky ..Now lets see something new ...

Part 2) Archiving the replicated partitioned tables .
Let us first create 2 new databases followed by creating partition functions followed by partition schemes followed by partitioned tables followed by inserting data in the tables .

--creating database and filegroups
create database test
GO
ALTER DATABASE test ADD FILEGROUP [second]
GO
ALTER DATABASE test ADD FILEGROUP [third]
GO
ALTER DATABASE test ADD FILEGROUP [forth]
GO
ALTER DATABASE test ADD FILEGROUP [fifth]
GO

--Adding new files to the filegroups
USE [master]
GO
ALTER DATABASE test ADD FILE ( NAME = N'test2', FILENAME = N'C:\Program Files\Microsoft SQL Server\test2.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [second]
GO
ALTER DATABASE test ADD FILE ( NAME = N'test3', FILENAME = N'C:\Program Files\Microsoft SQL Server\test3.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [third]
GO
ALTER DATABASE test ADD FILE ( NAME = N'test4', FILENAME = N'C:\Program Files\Microsoft SQL Server\test4.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [forth]
GO
ALTER DATABASE test ADD FILE ( NAME = N'test5', FILENAME = N'C:\Program Files\Microsoft SQL Server\test5.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [fifth]
GO

--The following partition function will partition a table or index into four partitions.
USE test
GO
CREATE PARTITION FUNCTION [PF_test](int) AS RANGE LEFT FOR VALUES (1,100,1000)

-- Creating partition scheme
use test
GO
IF NOT EXISTS (SELECT * FROM sys.partition_schemes WHERE name = N'PS_test')
create PARTITION SCHEME [PS_test] AS PARTITION [PF_test] TO ([second],[third],[forth],[fifth])
--[Note if you want to have one filegroup for all the files then : create PARTITION SCHEME [PS_test] AS PARTITION [PF_test] All TO ([secondary]) ]

--creating table with constraint and assigning a partition scheme to it
create table test (dummy [int] primary key constraint test_c check ([dummy] > 0 and [dummy] <=20000)) on ps_test (dummy) --inserting values
declare @val int
set @val=1000
while (@val > 0)
begin
insert into test..test values (@val)
set @val=@val-1
end

On Subscriber we will only create the same filegroups and add files to them :
--creating database and filegroups
create database test_sub
GO
ALTER DATABASE test_sub ADD FILEGROUP [second]
GO
ALTER DATABASE test_sub ADD FILEGROUP [third]
GO
ALTER DATABASE test_sub ADD FILEGROUP [forth]
GO
ALTER DATABASE test_sub ADD FILEGROUP [fifth]
GO

--Adding new files to the filegroups
USE [master]
GO
ALTER DATABASE test_sub ADD FILE ( NAME = N'test_sub2', FILENAME = N'C:\Program Files\Microsoft SQL Server\test_sub_sub2.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [second]
GO
ALTER DATABASE test_sub ADD FILE ( NAME = N'test_sub3', FILENAME = N'C:\Program Files\Microsoft SQL Server\test_sub_sub3.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [third]
GO
ALTER DATABASE test_sub ADD FILE ( NAME = N'test_sub4', FILENAME = N'C:\Program Files\Microsoft SQL Server\test_sub_sub4.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [forth]
GO
ALTER DATABASE test_sub ADD FILE ( NAME = N'test_sub5', FILENAME = N'C:\Program Files\Microsoft SQL Server\test_sub_sub5.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [fifth]
GO

Once you are done create the publication on database TEST and the add the article TEST.Once Its done, the Test_Pub Publication is ready to publish.After this we will create the Subscription on this publication .Our subscriber database is test_sub.Once the initial snapshot is synchronized you will see the following values :

select OBJECT_ID('test..test')
select OBJECT_ID('test_sub..test')
select * from test.sys.partitions where object_id in (2105058535) order by partition_number
select * from test_sub.sys.partitions where object_id in (133575514) order by partition_number

So as of now everything is as per plan . The data is synchronized in the correct partitions .Now if we need to archive the publisher table we can try the same old approach that we used in stage 1 .However , we will try something new here .That something new is SWITCHING OF PARTITIONS in the table .I will not explain what does it means because you will see it in few seconds(or you can refer BOL) .

--Lets first create the archive table on publisher .Its the replica of the original test table
create table test..test_archive (dummy [int] primary key constraint test_c_a check ([dummy] > 0 and [dummy] <=20000)) on ps_test (dummy) --Switching the partitions from test to test_archive table on publication
ALTER TABLE test..test SWITCH PARTITION 1 TO test_archive Partition 1;
GO

Msg 21867, Level 16, State 1, Procedure sp_MStran_altertable, Line 259
ALTER TABLE SWITCH statement failed. The table '[dbo].[test]' belongs to a publication which does not allow switching of partitions

Oopsss , What happened...Yes , that is true , we cannot switch the partitions for the replicated table , unless ..........we explicitly allow partition switching for the publisher :
sp_changepublication 'test_pub' ,@property='allow_partition_switch',@value='true'

You will get this message
The publication has updated successfully.

--Switching the partitions from test to test_archive table on publication. This is for partition number 1 . We have 4 partitions.
ALTER TABLE test..test SWITCH PARTITION 1 TO test_archive Partition 1;
GO
ALTER TABLE test..test SWITCH PARTITION 2 TO test_archive Partition 2;
GO
ALTER TABLE test..test SWITCH PARTITION 3 TO test_archive Partition 3;
GO
ALTER TABLE test..test SWITCH PARTITION 4 TO test_archive Partition 4;
GO

Now , just check the number of rows in the tables test and test_archive on publisher test and the test table on subscriber test_sub:

Thats the magic :)..Did you also notice that we did not create any partitions for the test_archive table .Lets query the syspartition DMV and notice the partitions in test_archive table .

Thats the beauty .You did not have to delete or truncate a single row.Nor did you remove the article or stopped any agent . Now if you add rows to table TEST it will be replicated to subscriber as usual .lets try this by inserting 1000 rows in table test in the publisher and then check the subscriber table
--inserting new values in test table ( in publisher database )
declare @val int
set @val=2000
while (@val > 1000)
begin
insert into test..test values (@val)
set @val=@val-1
end

Suggestions are welcome as we are here to help each other technicaly grow.Happly learning

ORA-00942: table or view does not exist .

2011-06-18T17:23:00.000+05:30

The complete error is :
select xactts, subguid from MSrepl7 where pubsrv = ? and pubdb = ? and indagent = 0 and subtype = 0
ORA-00942: table or view does not exist

We were setting up heterogeneous replicaton between SQL and Oracle 9i.This was done successfully.But when we tried synchronize the Articles (actually we were replicating only a view)we got stuck at this error.

It was clear that the error is coming from the subscriber .But we were not replicating the object : MSrepl7 and we dont know whether this is a table or view .Since we were not replicating it , I was sure that this is a table or view that replication is creating .I saw a KB article which talks about this table for DB2 : http://support.microsoft.com/KB/313332 .

Later I found from other oracle subscribers that MSrepl7 is nothing but a replica of msreplication_subscriptions table in SQL Server subscribers. This table is looked up and matched with msrepl_transactions .The columns that are compared are transaction_timestamp in MSrepl7 with xact_seqno in msrepl_transactions table .

moving forward we wanted to find why this table does not exist on the subscriber . I suspected that it should be created while Subscription setup or while we reinitialize subscription or while synchronizing .

To see if its really being created I enabled tracing on Distribution agent since its this job which is failing .
-Output C:\Temp\OUTPUTFILE.txt -Outputverboselevel 2.

Ran the Agent again which failed with same error .checked the file:

OLE DB Subscriber 'ERPDEV.WORLD': create table MSrepl7 (pubsrv varchar2 (128), pubdb varchar2 (128), publcn varchar2 (128), indagent number (1, 0),subtype number (10, 0), dstagent varchar2 (100),timecol date,descr varchar2 (255), xactts raw (16), updmode number (3, 0), agentid raw (16), subguid raw (16), subid raw (16), immsync number (1, 0))

Connecting to Distributor 'MCMSMESVS1.distribution'
[4/21/2010 1:32:46 AM]MCMSMESVS1.distribution: {call master..sp_MScheck_agent_instance(N'MCMSMESVS1-MES-ERPDEV.WORLD-1', 10)}
OLE DB Subscriber 'ERPDEV.WORLD': select xactts, subguid from MSrepl7 where pubsrv = ? and pubdb = ? and indagent = 0 and subtype = 0
Agent message code 20046. ORA-00942: table or view does not exist

[4/21/2010 1:32:46 AM]MCMSMESVS1.distribution: {call sp_MSadd_distribution_history(1, 6, ?, ?, 0, 0, 0.00, 0x01, 1, ?, -1, 0x01, 0x01)}
Adding alert to msdb..sysreplicationalerts: ErrorId = 12,

This means that the table is being created.But its not there .Strange.

To check whats happening ,I connected to Oracle server and tried to create a test table ....

Connected to:
Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.8.0 - Production

SQL> create table test(t int);
create table test(t int)
*
ERROR at line 1:
ORA-01031: insufficient privileges

Looks like when we start synchronizing it does not create sufficient objects at Oracle side due to permission issue .Sadly , SQL also does not throw any errors that we cannot create the object due to permission issue (might not have put the Try catch for this error).

So,I requested the Oracle DBA in charge of this to give the appropriate permissions to the login that is executing the distribution agent job and execute the Distributor job again .

BINGO ........the issue was resolved ...hope this blog helps someone someday ..

Regards
Abhay

What should we do first : Try to find the solution or try to find the Problem

2011-06-18T16:44:00.000+05:30

Had I been asked this question a few years ago , I would have said "I would search for a solution" .

Most of us do this i.e.we first try to find the solution .Sometimes we succeed but most of the times we do not.After many unsuccessful attempts I realized that the step to find the solution goes through another step first, and that is Finding the problem .Not going deep in to it .

A couple of months back one of my collegue came to me with a problem " There is a job that fails on every Monday" .This job takes some values from somewhere and inserts it in SQL Server tables.The error was :
Msg 241, Level 16, State 1, Line 2
Conversion failed when converting datetime from character string.

Earlier ,my collegue explained the client that this is not a SQL issue and suggested the poor client to touch base with the application team .But the client was not a fool .He said that there is some problem in SQL Server and he don't want to go to DEV without proof.He is not a techie though.

What Should I do , Google it or BING it :-) .we did not do that .
you can see this message in sysmessages . [select * from sys.messages where message_id=241]

We decided to reproduce the issue and with in 15 mins , we proved that the format in which Date is entered at the application level should be incorrect and datetime datatype is not recognizing it .

Repro of the issue

Repro 1
declare @date datetime ,@string char(100)
select @date =getdate()
set @string =@date

Command(s) completed successfully.

Repro 2
declare @date datetime ,@string char(100)
select @date ='28/07/2010'
set @string =@date

Msg 242, Level 16, State 3, Line 2
The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value.

So we are near as this error is almost similar to 241 , but not the same.

Repro 3
declare @date datetime ,@string char(100)
select @date ='2010/07/28'
set @string =@date

Command(s) completed successfully.

Repro 4
declare @date datetime ,@string char(100)
select @date =NULL <-- assuming someone might be putting NULL in date and since NULL can be anything, it might not be a CHAR and we will get the error.
set @string =@date
select @string
Command(s) completed successfully. <-- We did not Repro 5

declare @date datetime ,@string char(100)
select @date ='NULL'
set @string =@date
select @string

Msg 241, Level 16, State 1, Line 2
Conversion failed when converting datetime from character string.
here we got the error .

Repro 6
declare @date datetime ,@string char(100)
select @date ='NULL'

This is more clear
Msg 241, Level 16, State 1, Line 2
Conversion failed when converting datetime from character string.

So the reason could be :
The application adds single quotes to any entry .For example NULL will be converted to 'NULL' and 2010/07/28 will be converted to '2010/07/28' . In this case 'NULL' will give 241 but the date will be absolutely correct and will not throw error when inserted in the table (inside SQL Server).

Conclusion :
Always try to look for the reason behind the error/issue first rather jumping for solutions here and there .It might take time but you will learn more .

Regards
Abhay

Immediate Sync option in Transactional Replication : Good or Bad ..

2011-06-18T16:30:00.001+05:30

Around 4 months back we faced a latency issue in replication. We used tracer tokens and found that the distribution agent was lagging behind .Before I go forward, let me explain you that the distribution agent has 2 threads .Reader thread reads the value from the MSrepl_transactions table (this activity happens in parallel with the log reader agent where the rows are pumped in to the msrepl_transactions table) from distributor database and the writer thread apply those commands to the subscriber .

To find out where we are getting delayed we configured the verbose logging with level 2 in the distribution agent job (http://support.microsoft.com/kb/312292). In the output we saw that there is a good time gap after the sp_MSget_repl_commands command is fired and the next command .Now this command is fired to read the distribution database and populates the rows in the memory tables .These rows are then read by the writer thread and inserted into the subscriber database.

In the verbose log you can also check the statistics which is fired every 5 mins (starting SQL Server 2005 ):
•Cumulative Update 12 for SQL Server 2005 Service Pack 2
•Cumulative Update 5 for SQL Server 2008
•Cumulative Update 2 for SQL Server 2008 Service Pack 1

you will see data between these lines

*************STATISTICS SINCE AGENT STARTED *******************
[data]
***************************************************************************

It was clear that the reader thread was taking a lot of time in retrieving the rows as compared to the writer thread writing those commands to the subscriber database .

The next stage is to check the msrepl_transactions and msrepl_commands tables .We checked the number of rows in those tables and found that there a millions of rows which are replicated but still showing up in the in those tables . This was strange . We checked the cleanup job and found that the job was running fine .

So whats the issue ? Why the job is not removing the replicated rows ??
upon digging deep , we found that the DBA selected the "Create a snapshot immediately and keep the snapshot available to initialize subscriptions" option when he configured the replication .As a result Replication is suppose to keep all the transactions cached in the Distribution database for the entire Retention Period.You will also see all the snapshot files in the snapshot folder on the distributor .

Why it happens :
Every new subscriber that is added with in the subscription retention period needs the initial snapshot and then the data from the logreader is applied over the snapshot that is accumulated in the distribution database. But the Snapshot gets OLD too as the database image changes from time to time .Therefore every time a new snapshot is needed first. Because of this option set, the same old snapshot is applied first and then all the remaining LogReader entries from the distribution database .That is the reason all the old transactions are kept till the subscription expires.

Why is it configured ?
This command is configured in the environment where there is a need to create the subscription quite often and also if the snapshot increases in size quite considerab;y over the time .

How is it configured ?
Commandline :
via sp_addpublication <--Check BOL GUI :

Drawback :
Due to this option the size of Msrepl_transactions and Msrepl_commands increases ( which is more than normal) which slows up the synchronizations and clogs up the system.

How to disable it ?
EXEC sp_changepublication
@publication = 'your_publication_name',
@property = 'immediate_sync',
@value = 'false'
GO

Happy Learning !!!!!!!

Address translation: How Virtual Memory is mapped to Physical Memory

2011-06-18T15:13:00.000+05:30

Introduction
We all know that data retrieval will be fast if the data pages are found in RAM .If the data pages are not in RAM, they are fetched into RAM from the disk. This causes a physical IO .The page remains in the RAM until it’s again kicked off to Disk.
But the process and threads do not access the Physical memory (RAM) directly .Instead the RAM is accessed indirectly through Virtual Memory or Virtual address space (VAS) pointers. On a x86 operating system the number of such pointers in virtual memory that can point to physical memory is 4,294,967,296 (2^32) .This is equal to 4 GB .Out of this 4GB VAS pointers, 2GB worth of pointers are located in the Kernel address space and remaining 2GB in the User Address Space .It’s this 2GB of user address space which is used by the Processes and threads for their use and to map it to RAM. Other 2GB Kernel Address space is also mapped to RAM for the OS routines and APIs .So, normally on a 32 bit windows OS, SQL Server will use 2GB RAM (1.66GB Buffer Pool region and 384MB Mem2Leave region) .
But what’s the need for Virtual memory when we have Physical memory and it is the real memory .Let me correlate this to a smart Bank. The bank started with $5000.A customer deposited $1000 and the bank will return $1100 after a year .after 1 month, Another customer deposited $2000 for 1 year and the Bank will return $2200 after a year. So the bank has now $8000 for around 2 years .Then someone took a loan of $3000 for 1 year and the bank will get $4000 after a year .In between if the earlier 2 clients want to withdraw their deposit before time, they can pay the penalty and Bank has sufficient money to give back from the initial investment .In reality the banks or the moneylenders keep revolving the money which they might not even have.

I hope you have some idea now. The OS also works like this .It assures every process 4GB of memory .Right now there are 119 processes running on my laptop .If we go by this fact then the OS is ensuring 476 GB RAM to the processes .But I have only 2GB RAM on this laptop. That’s where the Virtual memory comes into picture .This 476 GB is actually a virtual memory address space and nothing else ; which does physically not exist .OS memory managers maps this virtual memory to Physical memory (RAM) .During this process the Page file on the disk is also used if a thread needs more physical memory than available .

Let’s skip discussing about AWE, /3GB, /USERVA and PAE for now as it will divert us from the topic which is to know how the Virtual addresses get translated to Physical addresses in RAM.

Address translation is the process of translating the virtual memory to physical memory.

Every Virtual Address has 3 components:

The Page Directory Index : For each process the OS memory manager creates this directory to use it to map all the page tables for the process.The address of this directory in stored somewhere in the address space called as KProcess Block (Kernal Process Block). To keep this subject less complex I will not explain what Kprocess Block is .The CPU keeps track of this Page Directory Index via a register called as CR3 or Control Register 3 .This Register also resides in KProcess Block .So the CPU’s MMU knows where the Page Directory Index is located with the help of This register (MMU: ttp://en.wikipedia.org/wiki/Memory_management_unit ).So the first 10 bits of the address space pointer has Page Directory Index value (there are a lot of page directory entries).This tells Windows which page table to use to locate the physical memory
associated with the address.

The Page Table Index : The second 10 bits of a 32-bit virtual address provide an index into this table and indicate which page table entry (PTE) contains the address of the page in physical memory to which the virtual address is mapped.

The Byte Index: the last 12 bits of a 32-bit virtual address contain the byte offset on the physical memory page to which the virtual address refers. The system page size determines the number of bits required to store the offset. Since the system page size on x86 processors is 4K, 12 bits are required
to store a page offset (4,096 = 2^12).

Let’s summarize it now:
1. The CPU’s Memory Management Unit locates the page directory for the process using the special register mentioned above.
2. The page directory index (from the first 10 bits of the virtual address) is used to locate the (P)age(D)irectory(E)ntry that identifies the page table needed to map the virtual address to a physical one.
3. The page table index (from the second 10 bits of the virtual address) is used to locate the PTE that maps the physical location of the virtual memory page referenced by the address.
4. The PTE is used to locate the physical page. If the virtual page is mapped to a page that is already in physical memory, the PTE will contain the page frame number (PFN) of the page in physical memory
that contains the data in question. If the page is not in physical memory, the MMU raises a page fault, and the Windows page fault–handling code attempts to locate the page in the system paging file. If the page can be located, it is loaded into physical memory, and the PTE is updated to reflect its location. If it cannot be located and the translation is a user mode translation, an access violation occurs because the virtual address references an invalid physical address. If the page cannot be located and the translation is occurring in kernel mode, a bug check (also called a blue screen) occurs.

How the Address translation happens with PAE in place:
Everything is same as above except that:
1)There is a new table which is above PDEs and PTEs .Its Page Directory Pointer Table.
2)The PTEs and PDEs are 64 bit wide as compared to 32 bit wide when PAE is not enabled.

How to resolve setup errors when registry keys are missing (Failed to install and configure assemblies C:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll in the COM+ catalog)

2011-06-18T14:58:00.001+05:30

Hi Guys ,
A couple of weeks back we had a setup issue where the SP3 setup was failing on Notification services and Client components .I am sharing my experience with you because it took us a lot of time to figure out that a few registry keys were missing .This explanation has 3 parts viz.

First Part : Solution of the issue .
Second Part : Reproducing this issue and finding a solution in a better way so that the same type of process can be followed for other similar setup issues as well .
Third Part : 2 bugs (Unfortunately , I could not reproduce it on my machine, but repeatedly reproduced it on clients machine .Will still try on my machine and file them later.

First Part : Issue and its solution .
Issue
As per the Security Bulitin MS09-004 we rasied a change to patch the DEV (and later would be Prod Servers)server to SQL Sevrer 2005 SP3 .Since we were failing on the DEV Server , we were not able to initiate the setup on Prod until the setup on Dev is resolved . Till then the Prod was under potential threat of SQL injection. The issue was that we were not able to upgrade SQL Server 2005 RTM (32 bit) to SQL Server 2005 SP3 (32 bit) on Windows Server 2008 (64 bit) .

Error Messages
Setup Errors :
Failed to install and configure assemblies C:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll in the COM+ catalog.
Error: 2148734209
Error message: Unknown error 0x80131501
Error descrition: MSDTC was unable to read its configuration information. (Exception from HRESULT: 0x8004D027)

Errors in Application log :
Unable to get the file name for the OLE Transactions Proxy DLL. Error Specifics: hr = 0x8004d027,
d:\rtm\com\complus\dtc\dtc\xolehlp\xolehlp.cpp:176, CmdLine: C:\Windows\syswow64\MsiExec.exe -Embedding 0E2EA3ADA5DC17201B521333814654B2 C, Pid: 3484

Failed to read the name of the default transaction manager from the registry. Error Specifics: hr = 0x00000002,
d:\rtm\com\complus\dtc\dtc\xolehlp\xolehlp.cpp:382, CmdLine: C:\Windows\syswow64\MsiExec.exe -Embedding 0E2EA3ADA5DC17201B521333814654B2 C, Pid: 3484

Errors in SQL Server :
QueryInterface failed for "ITransactionDispenser": 0x8004d027(XACT_E_UNABLE_TO_READ_DTC_CONFIG).

Client Environment Details
SQL Server 2005 RTM 32 bit Standard Edition (WOW mode)
Windows 2008 64 bit Standard

Troubleshooting done
-> The setup was failing only for Notification services and Client tools .Rest all components were successfully upgraded to SP3 .
-> Found the Both Notification services and Client tools were failing because of same error :

---------------------------------------------------------------------------Product : Client Components
Product Version (Previous): 1399
Product Version (Final) :
Status : Failure
Log File : C:\Program Files (x86)\Microsoft SQL Server\90\Setup Bootstrap\LOG\Hotfix\SQLTools9_Hotfix_KB955706_sqlrun_tools.msp.log
Error Number : 29549
Error Description : MSP Error: 29549 Failed to install and configure assemblies c:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll in the COM+ catalog. Error: -2146233087
Error message: Unknown error 0x80131501
Error description: MSDTC was unable to read its configuration information. (Exception from HRESULT: 0x8004D027)
---------------------------------------------------------------------------

---------------------------------------------------------------------------Product : Notification Services
Product Version (Previous): 1399
Product Version (Final) :
Status : Failure
Log File : C:\Program Files (x86)\Microsoft SQL Server\90\Setup Bootstrap\LOG\Hotfix\SQLTools9_Hotfix_KB955706_sqlrun_tools.msp.log
Error Number : 29549
Error Description : MSP Error: 29549 Failed to install and configure assemblies c:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll in the COM+ catalog. Error: -2146233087
Error message: Unknown error 0x80131501
Error description: MSDTC was unable to read its configuration information. (Exception from HRESULT: 0x8004D027)
---------------------------------------------------------------------------
-> Detailed MSP logs showed that the error comes when MSDTC is trying to register the notification services using regsvcs

Error: 2148734209
Error message: Unknown error 0x80131501
Error descrition: MSDTC was unable to read its configuration information. (Exception from HRESULT: 0x8004D027)
Error Code: -2146233087
Windows Error Text: Source File Name: sqlca\sqlassembly.cpp
Compiler Timestamp: Sat Oct 25 08:47:00 2008
Function Name: Do_sqlAssemblyRegSvcs
Source Line Number: 155

-> So the issue was with registering Notification Services DLL file which was common to both NS and Client components .As you can see RegSvcs was being called from inside the function Do_sqlAssemblyRegSvcs.

-> We then tried to manually register the microsoft.sqlserver.notificationservices.dll through command prompt :
%windir%\Microsoft.NET\Framework64\v2.0.50727\RegSvcs.exe /fc "C:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll"

-> This did not work and we were getting some other error .

-> Since COM+ Catalog was also showing up in the error ,I ran the SQL Server RTM setup and found the the SCC ( System configuration check ) is failing on COM + Catalog requirements .
->So,I checked the Component Services (DCOMCNFG) and found that there are no issues there .Everything was working fine .We also noticed that there were no errors related to COM components in the Application logs as well except the error related to MSDTC .The 32 bit COM components were also running fine .This was strange but made me believe that this error might be misleading .

-> We checked the SQL Sevrer errorlogs and found the same entry related to MSDTC , but in a slightly different manner :
QueryInterface failed for "ITransactionDispenser": 0x8004d027XACT_E_UNABLE_TO_READ_DTC_CONFIG)

-> We also saw some errors related to MSDTC :
Unable to get the file name for the OLE Transactions Proxy DLL.
Error Specifics: hr =
0x8004d027,d:\rtm\com\complus\dtc\dtc\xolehlp\xolehlp.cpp:176, CmdLine: C:\Windows\syswow64\MsiExec.exe
-Embedding -E2EA3ADA5DC17201B521333814654B2 C, Pid: 3484

Failed to read the name of the default transaction manager from the registry. Error Specifics: hr = 0x00000002,
d:\rtm\com\complus\dtc\dtc\xolehlp\xolehlp.cpp:382,
CmdLine: C:\Windows\syswow64\MsiExec.exe
-Embedding 0E2EA3ADA5DC17201B521333814654B2 C, Pid: 3484

-> Now the picture was clear that MSDTC had issues for sure. Also the COM+ Catalog warning is related to MSDTC issue (I had resolved the same issue a few years back :http://ms-abhay.blogspot.com/2009/10/msdtc-was-unable-to-read-its.html. But it was on Win server 2003)

-> The first error code is same 0x8004d027 but the second error code , which is coming before 0x8004d027 is 0x00000002 is nothing but telling us that there is some registry key missing ( there might be more ) . error 2 means "System cannot find the file specified".

-> Since we were not sure of which key was missing we decided to uninstall and reinstall MSDTC . This will automatically recreate all the missing registry keys.

-> But there was some more twist left. On windows Server 2008 we cannot simply uninstall MSDTC by using command "msdtc -uninstall" .We have to remove the MSDTC server role >> reboot the server >> re-add MSDTC in the Server roles.

-> We tried that ,but even after removing MSDTC from the server role , it was still showing as running in the services console .We tried a few times without success.

-> In between we also tried to repair .net framework 3.5 SP1 which did not help .

-> I also went through this article which talks about making sure that all the MSDTC related keys are present in the registry hive:
http://msdn.microsoft.com/en-us/library/dd300421(v=WS.10).aspx(This was the 1st step towards solution , although it did not work.)

-> However , the keys mentioned in this article were present in both WOW mode and normal mode in the Registry .

-> Then on my machine I tried to find the registries with the value "OLETransactionManagers". The first Key hit was : HKEY_CLASSES_ROOT\OLETransactionManagers

-> Since SQL Server was running in WOW mode on client's server , we tried to find HKEY_CLASSES_ROOT\Wow6432Node\OLETransactionManagers key and got it in first attempt.

-> However , on my laptop this key had some values while there were no values on the Client side registry key (mentioned above).

-> We found that these 64 bit MSDTC registry keys were there but the 32 bit (WOW mode) registry keys related to MSDTC were missing .

-> We created these keys in WOW mode

HKEY_CLASSES_ROOT\Wow6432Node\OLETransactionManagers
String value name = DefaultTM and value data=MSDTC

HKEY_CLASSES_ROOT\Wow6432Node\OLETransactionManagers\MSDTC
String value name=DLL and value data=MSDTCPRX.DLL

->We restarted MSDTC and SQL Server . We checked the SQL Server errorlogs and found that MSDTC related error is no longer showing and also the SQL Server SCC check was not showing that error related to COM+ Catalog any more.

->This gave us some hope and with that hope we hoped that the setup will run successfully .This in turn did happen (",).

Second Part: Reproducing the error
I reproduced this on my machine by removing all the keys from HKEY_CLASSES_ROOT\OLETransactionManagers

When I ran the setup it gave me the same warnings on SCC :

I will still go ahead with the setup and select only the client components to install :

The setup will encounter this error:

Let’s check the Application logs:

At this stage First let me introduce a tool that can show you how to find the missing registry key or if there are any permission related issues .I have been using this tool since long now .This tool is Procmon from sysinternals http://technet.microsoft.com/en-us/sysinternals/bb896645".We will use this tool to find the missing registry keys .However this tool captures a lot of information of all the processes .So we need to first PAUSE it as soon as we launch it .

Now, we will find the process ID of MSIEXEC.exe from Task Manager since as per the Application logs its the msiexec command that is failing:

I will now filter it on those 3 PIDs

Once it’s done, uncheck the PAUSE icon (the magnifying glass) and click on the retry option on the setup window. Quickly after that ,again PAUSE the Procmon by clicking on the magnifying glass icon. Notice the output in the Procmon. You will see a lot of different keys there .We now need to delete not necessary entries:

Right click on SUCCESS and select:

Right click on RegOpenKey and select:

You are now left with only 6 keys to look at(You can follow the same steps for other missing Reg keys issues ) and all these keys are REGQueryValue:

Now ,notice the error copied above from application log:
*******

Now, double click on each of the 6 keys (start from the bottom most) and select the option >> PROCESS. You will notice the same command line string as showing in the application logs:

So the first key that is missing is HKCR\OLETransactionManagers\DefaultTM .The reason it’s not showing all the missing keys is because it’s failing on first key itself. If you create the first key and then click on retry button on the setup the Procmon will show you the next missing key.

But what should be its value .We can check on other servers (preferably same server version).Create the missing keys and click on retry button .The setup will be successful (in our case it was SP3 and not the initial setup .But the resolution is same).

Third part : The Bugs
1) The issue is , when the SP3 setup fails it should ideally rollback everything to normal .So the NS and client tools should work fine.But On the client's server we saw that the Management Studio stop working and throws an error ( I dont have the screenshot now . Will try and reproduce it ) . The only solution is to uninstall the client tools and reinstall again .During the setup you will again get this error .Click on ignore and the setup will complete .

2) On my laptop , I saw that it also corrupt other components of SQL Server 2008 like Books on Line ..Again I do not have that proof now but will reproduce it .The solution is to uninstall client tools and reinstall tools of SQL Server 2008.

Happy Learning !!

Understanding how SQL Server behaves in a non-preemptive mode while still running on the OS which is preemptive

2011-06-18T14:04:00.000+05:30

Today , I was asked by someone how SQL Server behaves in a non-preemptive way while still running on the OS which is preemptive .Even though I explained this theoritically ,I was requested if there is a way we can see it practically . It wasn't difficult but it really proved to what I explained .I alos felt that if we add practicals to theory it will have much deeper impact ...

NOTE : I am doing all the testing on SQL Server 2008 RTM EVAL as what I am trying to explain can be done in 2008 onwards.On 2005 you will not see what is written here.So I would request to use SQL 2008 and above .

Windows Scheduling (Preemptive) :
Starting from Windows NT 3.1 (XP ,2000,2003 etc) Windows scheduling was priority driven i.e preemptive scheduling .So every thread will have a priority associated with it .Based on this priority the threads will get the time slice (Quantum) to run on the CPU .So , even if thread of lower priority is running and all of a sudden another thread of a higher
priority comes up , the low priority thread will be preempted interrupted) and the higher priority thread will be scheduled to run on the CPU.However , the scheduler is smart enough .It will keep the preempted thread on the top of the waiting threads by adjusting its priority (Lets not get too deep in to this at this point).

Prior to this OS scheduling was non-preemptive i.e. cooperative scheduling .Remember the days when Windows 98 use to hang and we use to reboot the server quite often ,to get rid of it.Cooperative scheduling is good if all the threads leave the CPU after some time and give chance to other threads (including kernal mode threads which are more
Important and get a chance to run whenever required) after a fixed interval of time . But that normally does not happen .some nasty application threads don't yield and hence blocks other threads.

SQL Server Scheduling (non-preemptive) :
SQL Server has its own scheduling mechanism and it does not follow OS scheduling (looks strange as it runs on the preemptive OS) .Its called as UMS (User Mode Scheduling) in 2000 and SOS (SQL OS) in 2005 and above .BUT :
1) Why SQL Server does not hang just like windows 98 use to ?
Answer: SQL Server does not hang because its threads yield every voluntarily.In case a thread does not yield in 60 seconds (unlike the faulty application where the threads does not yield)SQL Server throws non-yielding scheduler hung error and throws a mini dump with the stack information of all the threads in it .

2) How SQL Server manages to schedule in the non-preemptive way ?
Answer: Windows OS will not schedule any thread which is running in infinite wait loop and simply ignores it. SQL Server (actually UMS) takes advantage of this and cleverly puts all the threads which it does not want to schedule to infinitely sleep by calling WaitforsingleObject function in an infinite loop .When SQL Server wants the thread to run it simply signal the thread and it comes out of the sleep modes .Its the Windows which then schedules the thread .Its important to know that UMS schedules only ONE SQl Server thread per CPU .However , there is an exception to this . There are moments where to complete a task the thread leaves the SQL Server scheduler and goes to preemptive mode scheduling . For example using xp_cmdshell to open notepad or running an extended stored procedure that deals with filesystem (like reading a file) or a linked server query
etc.In that situation , you will see more than one thread on a single CPU in runnable status .That is because one thread is scheduled via UMS\SOS and the other one directly via OS scheduling.

Let me show you a Demo since my Laptop has only one dual core processor (Its SQL Server 2008 RTM):
Lets first run a simple query and find runnable and sleeping threads Select Status ,* from sysprocesses where status not in ('background')

you will notice that all the SPIDS will show you the status of sleeping and there is only one SPID that will be showing you the status of runnable .Its waittype will be PREEMPTIVE_os_WAITFORSINGLEOBJECT .Notice that only runnable SPID has a KPID associated with it .This KPID is nothing but the worker thread associated with the SPID.You can run it a few times but the output will not change except the KPID which means that one thread is yielding to another after the context switching .The reason why we see runnable state and not running ,because by the time we get the query output the thread again goes to runnable state.You might also see other runnable or suspended SPIDs but its because they are running in preemptive mode .

Now lets open another Query window and execute the same command there .
Select Status ,* from sysprocesses where status not in ('background'). This time its SPID 53 (current SPID on my machine) which is showing us the Runnable state while SPID 52 (the previous SPID)is now sleeping .

Lets do one more experiment.Open a new query window (SPID 51 in my case) and run select @@servicename around 1000 times . Come back to SPID 53 window and notice if the runnable state is showing for SPID 53 or SPID 51.You will notice that we have the SPID 51 doing its task .But why is it showing as sleeping while CPU value is still increasing ? The reason is that when we run the query via SPID 53 , during (and only) that time 51 shows as sleeping because 53 needs to run . So thread related to 51 yields voluntarily. when this query finishes , SPID 51 again picks up , but we cant see that since we have only one processor :) ...

Anyway , let me show you the small test when the SQL Server thread goes preemptive .We have 2 Query windows.One with sysprocesses query (SPID 53) and one with calling xp_readerrorlog 100 times (SPID 51).I further modified my sysprocesses query by filtering sleeping SPIDs.

Select Status ,* from sysprocesses where status not in ('background','sleeping') .Lets run the query through SPID 51 and then by SPID 53.Notice that you have 2 runnable SPIDs now . Thats because SPID 51 is scheduled by OS and not SQL OS \UMS .

Have a nice day and Happy Learning !!!

Finding optimal number of CPUs for a given long running CPU intensive queries (except OLAP queries)

2011-04-27T13:23:00.000+05:30

Hi Guys ,

Hope this article will help you in some or the other way one day :) .....

Introduction:
This small article is applicable for finding optimal number of CPUs for long running CPU intensive queries/workload that doesn’t frequently wait for other resources and is not applicable if your queries/workload is often waiting for resources (like I/Os, Locks, Latches etc.) without consuming CPU in a stretch .it can also provide information on uneven CPU load across NUMA nodes and uneven CPU load within same NUMA node (load_factor effect).
It is recommended to analyze Windows Performance Monitor Counters for monitoring CPU pressure. Processor utilization greater then 75% to 80% indicates CPU pressure. Using Windows Performance Monitor should be the 1st step, the procedure suggested in this article should be considered as an additional step.
Further ,it is very important to find ways to optimize the queries/workload by tuning the database schema before attempting to add additional CPUs.

Description:
When a customer asks you: I am running a resource consuming SQL job and it takes x amount of time, how can I reduce the time so the SQL job completes sooner, can I add more CPUs ? if yes, how many ?
When you see CPU pressure, there are 2 options: you can either upgrade to faster CPUs or add additional CPUs [assuming that the queries are well tuned and normalized]. Upgrading to faster CPU will always help. Adding additional CPUs may not always help the SQL job to run faster unless that SQL job can take advantage of additional CPUs [read Max Degree of parallelism form BOL]. If the customer already has the fastest CPUs available in the market then they have to wait for the next release of faster CPUs. One more choice woiuld be to add additional CPUs and see if it helps, the below procedure will help you identify if this is the case.
This method calculates total user waits for CPU during the SQL workload and suggests additional CPUs if necessary. If CPU usage is at 100%, but no one waited for CPU during the workload, then adding additional CPU will not help; this is the basics of this calculation.
Current recommendations that are available on this topic calculates ‘signal wait time’ to ‘wait time’ ratio to suggest CPU pressure – but this cannot help one easily identify number of additional CPUs necessary.

Procedure:
When concurrent users apply simultaneous CPU intensive workload, there could be CPU pressure. We can conclude presence of CPU pressure when at any given moment during this time period at least one or more user tasks waited for CPU resource.
In this case one can run the below query to find out how many CPU on an average will help to scale(out) the workload better. It might be more informative to collect the below information in short time intervals (many samples) than just once to understand during which time of the workload application the CPU pressure was the most. Single sample will lead to average additional CPUs necessary for the entire workload duration.
1. Reset Wait Stats
dbcc sqlperf('sys.dm_os_wait_stats', clear)
2. Apply workload (you can find sample workload query at the end of this article, you need to execute the sample workload query simultaneously in many sessions to simulate concurrent user tasks).
3. Run the below query to find Additional CPUs Necessary – it is important to run the query right after the workload completes to get reliable information.

select round(((convert(float, ws.wait_time_ms) / ws.waiting_tasks_count) / (convert(float, si.os_quantum) / si.cpu_ticks_in_ms) * cpu_count), 2) as Additional_CPUs_Necessary,
round((((convert(float, ws.wait_time_ms) / ws.waiting_tasks_count) / (convert(float, si.os_quantum) / si.cpu_ticks_in_ms) * cpu_count) / hyperthread_ratio), 2) as Additional_Sockets_Necessary
from sys.dm_os_wait_stats ws cross apply sys.dm_os_sys_info si where ws.wait_type = 'SOS_SCHEDULER_YIELD'

Example:
When you have 2 CPUs and you run the sample workload with just 1 or 2 concurrent sessions – you will see no recommendation for addition additional CPUs – unless there is unbalanced user task distribution across CPUs. On the other hand if you run the workload with 4 concurrent sessions – you will notice the query suggests you to add 2 additional CPUs. If you run with 6 concurrent sessions – you will notice the query suggests you to add 4 additional CPUs.
If each workload runs in parallel (MAXDOP not 1), then you will notice additional CPU suggestion, you need to be careful in this case. For example with 2 CPUs when you run the workload (in parallel, MAXDOP 0/2) with 2 concurrent sessions, you will notice the suggestion to add 2 additional CPUs – this just indicates the workload is more scalable with more CPUs – parallel query execution as you can imagine can consume as many CPUs as you have and can consume more!!
The results are not reliable when other applications are running in the system. Also the results might be incorrect on a hyper threading enabled system.

Explanation:
When there are more user tasks concurrently needing CPU than available CPU, the excess user tasks will wait for CPU (there are exceptions when the workload is not evenly distributed across CPUs). In this case each user task uses its quantum, then goes into a wait state (waiting for CPU with wait_type SOS_SCHEDULER_YIELD. sys.dm_exec_requests doesn’t show this wait type, probably by design to avoid showing user tasks in wait state when they are waiting for CPU. But sys.dm_os_wait_stats will include these waits) until all other runnable user tasks have used their quantum. If one measures how many tasks went into this wait state and for how long while the workload was applied – it is possible to calculate the CPUs necessary to scale the workload better.
runnabkle_task_count from the dm_os_schedulers is also a indication for CPU pressure, but it is just a probe – one cannot reasonably conclude the number of CPUs necessary for a given workload.

Exception:
There is an exception(for OLTP like workload) where a user tasks doesn’t consume all of its quantum(goes into some other wait state before the quantum expires, waiting for I/Os, Locks, Latches etc.) in a stretch, but continues to run in a loop using CPU without using its full quantum(You know what quantum is ...right :D). The method mentioned here cannot calculate the necessary additional CPUs in this case.. Most common example is short transactions using part of its quantum and starts WRITELOG waits and continues in a loop – inserts using implicit transactions in a loop is a typical example.

Sample Workload:
Create the below table before running the query to generate CPU intensive workload.
Serial Workload:
select max(t1.c2 + t2.c2) from tab7 t1 cross join tab7 t2 option (maxdop 1)
Parallel Workload:
select max(t1.c2 + t2.c2) from tab7 t1 cross join tab7 t2
Table:
create table tab7 (c1 int primary key clustered, c2 int, c3 char(2000))
go
begin tran
declare @i int
set @i = 1
while @i <= 5000
begin
insert into tab7 values (@i, @i, 'a')
set @i = @i + 1
end
commit tran
go

Happy Learning !!!

SQL Server Setup has encountered the following error:File format is not valid

2010-11-09T22:19:00.000+05:30

Today , we faced an issue where SQL Server 2008 R2 setup was failing at the very beginning .This issue can also be reproduced on 2008 and also in 2005 (in a slightely different way) .Please find the RCA below :

Version : SQL Server 2008 R2
OS : Win Server 2008

Error :
TITLE: SQL Server Setup failure.
------------------------------
SQL Server Setup has encountered the following error:
File format is not valid..
------------------------------
BUTTONS:
OK
------------------------------

Resolution :
Its very clear that there is a file that does not have a correct format and SQL Server cannot read it .So we need to find which file is that .We need to first open the Setup logs .In SQL 2008 a folder is created with the timstamp and all the logs are created inside it .On my machine it was C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Log\20101109_201205 .

I first opened Detail_ComponentUpdate.txt

2010-11-09 20:13:14 Slp: Running Action: GatherUserSettings
2010-11-09 20:13:21 Slp: -- PidPublicConfigObject : ValidateSettings is normalizing input pid=[PID value hidden]
2010-11-09 20:13:21 Slp: -- PidPrivateConfigObject : NormalizePid is normalizing input pid=[PID value hidden]
2010-11-09 20:13:21 Slp: -- PidPrivateConfigObject : NormalizePid found a pid containing dashes, assuming pid is normalized, output pid=[PID value hidden]
2010-11-09 20:13:21 Slp: -- PidPublicConfigObject : ValidateSettings proceeding with normalized pid=[PID value hidden]
2010-11-09 20:13:21 Slp: -- PidPrivateConfigObject : Initialize is initializing using input pid=[PID value hidden]
2010-11-09 20:13:21 Slp: -- PidPrivateConfigObject : NormalizePid is normalizing input pid=[PID value hidden]
2010-11-09 20:13:21 Slp: -- PidPrivateConfigObject : NormalizePid found a pid containing dashes, assuming pid is normalized, output pid=[PID value hidden]
2010-11-09 20:13:21 Slp: -- PidPrivateConfigObject : Initialize proceeding with normalized pid=[PID value hidden]
2010-11-09 20:13:21 Slp: -- PidPrivateConfigObject : Initialize called ValidatePid, output is pid=[PID value hidden] validateSuccess=True output editionId=EVAL(0x2467BCA1)
2010-11-09 20:13:21 Slp: -- PidPublicConfigObject : ValidateSettings initialized private object, result is initializeResult=Success

2010-11-09 20:13:22 Slp: Detected localization resources folder: 1033
2010-11-09 20:13:22 Slp: License file: C:\Documents and Settings\Abhay\Desktop\2008 R2 X64\2008_R2_x86\x86\1033\License_EVAL.rtf
2010-11-09 20:13:22 Slp: Error: Action "GatherUserSettings" threw an exception during execution.

2010-11-09 20:13:22 Slp: Microsoft.SqlServer.Setup.Chainer.Workflow.ActionExecutionException: Thread was being aborted. ---> System.Threading.ThreadAbortException: Thread was being aborted.
2010-11-09 20:13:22 Slp: at System.Threading.WaitHandle.WaitOneNative(SafeWaitHandle waitHandle, UInt32 millisecondsTimeout, Boolean hasThreadAffinity, Boolean exitContext)
2010-11-09 20:13:22 Slp: at System.Threading.WaitHandle.WaitOne(Int64 timeout, Boolean exitContext)
2010-11-09 20:13:22 Slp: at System.Threading.WaitHandle.WaitOne(Int32 millisecondsTimeout, Boolean exitContext)
2010-11-09 20:13:22 Slp: at System.Threading.WaitHandle.WaitOne()
2010-11-09 20:13:22 Slp: at Microsoft.SqlServer.Configuration.UIExtension.Request.Wait()
2010-11-09 20:13:22 Slp: at Microsoft.SqlServer.Configuration.UIExtension.UserInterfaceProxy.SubmitAndWait(Request request)
2010-11-09 20:13:22 Slp: at Microsoft.SqlServer.Configuration.UIExtension.UserInterfaceProxy.NavigateToWaypoint(String moniker)
2010-11-09 20:13:22 Slp: at Microsoft.SqlServer.Configuration.UIExtension.UserInterfaceService.Waypoint(String moniker)
2010-11-09 20:13:22 Slp: at Microsoft.SqlServer.Configuration.UIExtension.WaypointAction.ExecuteAction(String actionId)
2010-11-09 20:13:22 Slp: at Microsoft.SqlServer.Chainer.Infrastructure.Action.Execute(String actionId, TextWriter errorStream)
2010-11-09 20:13:22 Slp: at Microsoft.SqlServer.Setup.Chainer.Workflow.ActionInvocation.InvokeAction(WorkflowObject metabase, TextWriter statusStream)
2010-11-09 20:13:22 Slp: at Microsoft.SqlServer.Setup.Chainer.Workflow.PendingActions.InvokeActions(WorkflowObject metaDb, TextWriter loggingStream)
2010-11-09 20:13:22 Slp: --- End of inner exception stack trace ---
2010-11-09 20:13:22 Slp: at Microsoft.SqlServer.Setup.Chainer.Workflow.PendingActions.InvokeActions(WorkflowObject metaDb, TextWriter loggingStream)
2010-11-09 20:13:25 Slp: Received request to add the following file to Watson reporting: C:\Documents and Settings\Abhay\Local Settings\Temp\tmp11E.tmp
2010-11-09 20:13:25 Slp: The following is an exception stack listing the exceptions in outermost to innermost order
2010-11-09 20:13:25 Slp: Inner exceptions are being indented
2010-11-09 20:13:25 Slp:
2010-11-09 20:13:25 Slp: Exception type: System.ArgumentException
2010-11-09 20:13:25 Slp: Message:
2010-11-09 20:13:25 Slp: File format is not valid.
2010-11-09 20:13:25 Slp: Stack:
2010-11-09 20:13:25 Slp: at System.Windows.Forms.RichTextBox.StreamIn(Stream data, Int32 flags)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.RichTextBox.LoadFile(Stream data, RichTextBoxStreamType fileType)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.RichTextBox.LoadFile(String path, RichTextBoxStreamType fileType)
2010-11-09 20:13:25 Slp: at Microsoft.SqlServer.Configuration.InstallWizard.EULAPidView.UpdateLicenseText(String filepath)
2010-11-09 20:13:25 Slp: at Microsoft.SqlServer.Configuration.InstallWizard.EULAPidController.LoadData()
2010-11-09 20:13:25 Slp: at Microsoft.SqlServer.Configuration.InstallWizardFramework.InstallWizardPageHost.PageEntered(PageChangeReason reason)
2010-11-09 20:13:25 Slp: at Microsoft.SqlServer.Configuration.WizardFramework.UIHost.set_SelectedPageIndex(Int32 value)
2010-11-09 20:13:25 Slp: at Microsoft.SqlServer.Configuration.WizardFramework.UIHost.GoNext()
2010-11-09 20:13:25 Slp: at Microsoft.SqlServer.Configuration.WizardFramework.NavigationButtons.nextButton_Click(Object sender, EventArgs e)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.Control.OnClick(EventArgs e)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.Button.OnClick(EventArgs e)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.Control.WndProc(Message& m)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.ButtonBase.WndProc(Message& m)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.Button.WndProc(Message& m)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
2010-11-09 20:13:25 Slp: at System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
2010-11-09 20:28:33 Slp: Sco: Attempting to write hklm registry key SOFTWARE\Microsoft\Microsoft SQL Server to file C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Log\20101109_201205\Registry_SOFTWARE_Microsoft_Microsoft SQL Server.reg_
2010-11-09 20:28:33 Slp: Sco: Attempting to write hklm registry key SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall to file C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Log\20101109_201205\Registry_SOFTWARE_Microsoft_Windows_CurrentVersion_Uninstall.reg_
2010-11-09 20:28:33 Slp: Sco: Attempting to write hklm registry key SOFTWARE\Microsoft\MSSQLServer to file C:\Program Files\Microsoft SQL Server\100\Setup Bootstrap\Log\20101109_201205\Registry_SOFTWARE_Microsoft_MSSQLServer.reg_
2010-11-09 20:28:36 Slp: File format is not valid.
2010-11-09 20:28:36 Slp: Watson bucket for exception based failure has been created
2010-11-09 20:28:36 Slp: Sco: Attempting to create base registry key HKEY_LOCAL_MACHINE, machine
2010-11-09 20:28:36 Slp: Sco: Attempting to open registry subkey Software\Microsoft\PCHealth\ErrorReporting\DW\Installed
2010-11-09 20:28:36 Slp: Sco: Attempting to get registry value DW0200
2010-11-09 20:29:01 Slp: Submitted 1 of 1 failures to the Watson data repository

Assuming that there is a file with incorrect format I took a chance to open this file as mentioned in the error above .C:\Documents and Settings\Abhay\Desktop\2008 R2 X64\2008_R2_x86\x86\1033\License_EVAL.rtf . Since this is an RTF file we can open it in WORDPAD ....

When opened , I found it unreadable ...Initially I thought it is suppose to be like that as there might be something encrypted .
However , there were other license files in the same folder which were absolutely readable ..

This made me curious and I checked the same file on my machine as I also had the same EVAL setup ...I was able to read it word by word .So it was clear that the file was corrupt ....We tried and swapped the file between my TP and the server ...The setup moved forward :) .

Hope this helps .

Error 1706. An installation package for the product Microsoft SQL Server 2005 cannot be found. Try the installation again using a valid copy of the installation package 'SqlRun_SQL.msi'.

2010-11-09T22:12:00.000+05:30

adding a new post after a good gap ...
Recently we faced an issue where we lost physical files of Master database (master.mdf and mastlog.ldf).We had the backup files but we could not use them unless SQL Server is up and running .So we had no choice but to rebuild master .

We tried the step below via DOS prompt :
C:\Documents and Settings\Abhay\Desktop\softwares\SQLEVAL_2005\Servers>start /wait setup.exe /qn INSTANCENAME=CORRUPT REINSTALL=SQL_Engine REBUILDDATABASE=1 SAPWD=XXXXX

This had always worked for me and is also mentioned in BOL .However , this time this did not work .The errors I got in the log file were (to find the error we should search for Return Value 3):

Error 1706. An installation package for the product Microsoft SQL Server 2005 cannot be found. Try the installation again using a valid copy of the installation package 'SqlRun_SQL.msi'.
MSI (s) (2C:08) [14:07:26:265]: User policy value 'DisableRollback' is 0
MSI (s) (2C:08) [14:07:26:265]: Machine policy value 'DisableRollback' is 0
Action ended 14:07:26: InstallFinalize. Return value 3.

The setup files were valid and had been used many time in the past ..I ran SQLRun_SQL.msi manually and it was running fine .I also used 2 different setups and go the same error .
Also

MSI (s) (2C:08) [14:07:26:281]: File: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\modellog.ldf; To be installed; Won't patch; No existing file
MSI (s) (2C:08) [14:07:26:281]: Executing op: FileCopy(SourceName=C:\Config.Msi\3bf6ceb.rbf,,DestName=C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\model.mdf,Attributes=32800,FileSize=1245184,PerTick=0,,VerifyMedia=0,ElevateFlags=3,,,,,,,InstallMode=4194308,,,,,,,)
MSI (s) (2C:08) [14:07:26:281]: File: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\model.mdf; To be installed; Won't patch; No existing file
MSI (s) (2C:08) [14:07:26:296]: Executing op: FileCopy(SourceName=C:\Config.Msi\3bf6cea.rbf,,DestName=C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\msdblog.ldf,Attributes=32800,FileSize=786432,PerTick=0,,VerifyMedia=0,ElevateFlags=3,,,,,,,InstallMode=4194308,,,,,,,)
MSI (s) (2C:08) [14:07:26:296]: File: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\msdblog.ldf; To be installed; Won't patch; No existing file
MSI (s) (2C:08) [14:07:26:296]: Executing op: FileCopy(SourceName=C:\Config.Msi\3bf6ce9.rbf,,DestName=C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\msdbdata.mdf,Attributes=32800,FileSize=12255232,PerTick=0,,VerifyMedia=0,ElevateFlags=3,,,,,,,InstallMode=4194308,,,,,,,)
MSI (s) (2C:08) [14:07:26:296]: File: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\msdbdata.mdf; To be installed; Won't patch; No existing file
MSI (s) (2C:08) [14:07:26:296]: Executing op: FileCopy(SourceName=C:\Config.Msi\3bf6ce8.rbf,,DestName=C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\mastlog.ldf,Attributes=32800,FileSize=853016576,PerTick=0,,VerifyMedia=0,ElevateFlags=3,,,,,,,InstallMode=4194308,,,,,,,)
MSI (s) (2C:08) [14:07:26:296]: File: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\mastlog.ldf; To be installed; Won't patch; No existing file
MSI (s) (2C:08) [14:07:26:312]: Executing op: FileCopy(SourceName=C:\Config.Msi\3bf6ce7.rbf,,DestName=C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\master.mdf,Attributes=32800,FileSize=92602368,PerTick=0,,VerifyMedia=0,ElevateFlags=3,,,,,,,InstallMode=4194308,,,,,,,)
MSI (s) (2C:08) [14:07:26:312]: File: C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\master.mdf; To be installed; Won't patch; No existing file

This was strange as these files were existing ....

I ran the same command using \qb option and got the same error but in a form of a pop-up box.

I then checked and found that there is an option called as REINSTALLMODE .;REINSTALLMODE is used to repair installed components. The supported values are:
O – Reinstall if file is missing, or an older version is present.
M – Rewrite machine specific reg keys under HKLM
U – Rewrite user specific reg keys under HKCU
S – Reinstall all shortcuts

The Option O looked appropriate but I used all i.e.
C:\Documents and Settings\Abhay\Desktop\softwares\SQLEVAL_2005\Servers>start /wait setup.exe /qn INSTANCENAME=CORRUPT REINSTALL=SQL_Engine REINSTALLMODE=OMUS REBUILDDATABASE=1 SAPWD=XXXXX

This resolved the issue on my laptop but not on client server .
Finally I found that there is one more option which is never documented ...and this option is V
The issue was that the setup was copied for a different server and the original media location of where the RTM bits where installed in some cache file.That was the reason we were getting the error about the installation package not being found.To resolve this we had to use the option V to re-cache the media from the new location.

C:\Documents and Settings\Abhay\Desktop\softwares\SQLEVAL_2005\Servers>start /wait setup.exe /qn INSTANCENAME=CORRUPT REINSTALL=SQL_Engine REINSTALLMODE=V REBUILDDATABASE=1 SAPWD=XXXXX

This ran like a knife through butter ..

Hope it will help you in future ...

Error: 26049, Severity: 16, State: 1 :Server local connection provider failed to listen on [ \\.\pipe\SQLLocal\XXXXX ]. Error: 0x5

2010-09-29T17:35:00.000+05:30

There might be many reasons and many solutions for this kind of error.But let me explain my situation :) ...For testing I installed a new default instance on one of the test servers.The setup was successful .However , later one of the other named instance did not come up after the restart.

The errors were :

2010-09-30 03:17:31.65 Server Error: 26049, Severity: 16, State: 1.
2010-09-30 03:17:31.65 Server Server local connection provider failed to listen on [ \\.\pipe\SQLLocal\XXXXX ]. Error: 0x5
2010-09-30 03:17:31.65 Server Error: 17182, Severity: 16, State: 1.
2010-09-30 03:17:31.65 Server TDSSNIClient initialization failed with error 0x5, status code 0x40.
2010-09-30 03:17:31.65 Server Error: 17182, Severity: 16, State: 1.
2010-09-30 03:17:31.65 Server TDSSNIClient initialization failed with error 0x5, status code 0x1.
2010-09-30 03:17:31.65 Server Error: 17826, Severity: 18, State: 3.
2010-09-30 03:17:31.65 Server Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.
2010-09-30 03:17:31.65 Server Error: 17120, Severity: 16, State: 1.
2010-09-30 03:17:31.65 Server SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

The issue is quite simple unlike it looks like .I tried everything like changing the named pipe , etc. etc.
Assuming, that 0x5 is the OS error which means Access Denied ,I gave the permission to the domain ID on the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\MSSQLServer.
This resolved the issue .

status code 0x40 means that there is an issue with Shared memory listener
status code 0x50 means that there is an issue with Named pipe listener
status code 0x0A means that there is an issue with TCP/IP listener

Please go through this MSDN blog (which has one more link in it).
http://blogs.msdn.com/b/sql_protocols/archive/2006/03/09/546655.aspx?wa=wsignin1.0

Happy Learning !!!

A simple VB script to retain Errorlogs worth 90 days (or as you like)

2010-09-28T16:28:00.000+05:30

So far ,I heard retaining X number of errlogs which is widely used (So I am not writing that script here)...But one of our clients asked us to retain errlogs worth only 90 days .The client was not ready to recycle the errorlogs and wanted us to keep them to the default ...

Finally , we could come out with a simple VB script that can do it .The script code is mentioned below .

Code :

sFolder = "C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG"
iMaxAge = 90
Set oFSO = CreateObject("Scripting.FileSystemObject")
If oFSO.FolderExists(sFolder) Then
for each oFile in oFSO.GetFolder(sFolder).Files
If DateDiff("d", oFile.DateLastModified, Now) > iMaxAge and (oFile.name= "ERRORLOG.1" or oFile.name= "ERRORLOG.2" or oFile.name= "ERRORLOG.3" or oFile.name= "ERRORLOG.4" or oFile.name= "ERRORLOG.5" or oFile.name= "ERRORLOG.6") Then
wscript.echo "Deleting" &oFile.Name
oFile.Delete
End If
next
End If

You will need to create a scheduled task/Or SQL Server job using xp_cmdshell, to run at a specific time .Once its kicked off , if any of the files mentioned in the code (Note : errorlog will not be tried upon) have a timestamp greater than 90 days from the day you are executing the file , It will delete those files ...for example if I have 7 files below :

File Timestamp
Errorlog 12/9/2009
Errorlog.1 12/8/2009
Errorlog.2 12/7/2009
Errorlog.3 12/6/2009
Errorlog.4 12/5/2009
Errorlog.5 12/4/2009
Errorlog.6 12/3/2009

The files deleted will be : Errorlog.3,Errorlog.4,Errorlog.5 and Errorlog.6

You need to change the Path of sFolder variable ..
Happy learning ...

Abhay

checkODBCConnectError: sqlstate = 28000; native error = 4818; message = [Microsoft][SQL Native Client][SQL Server]Login failed for user 'XXXXX\clusterservice

2010-09-28T16:23:00.000+05:30

a little Background :
As per our security guidelines Builtin\Administrator login should be removed from all the SQL Server instances.It was implemented on all the SQL Server instances including those which are on MCSC (Windows Cluster).
After that, the nodes were rebooted due to patching requirements .The nodes came up , but SQL Server did not :D ...

Error in cluster logs (you will not find it in SQL Server logs) :

ERR SQL Server : [sqsrvres] checkODBCConnectError: sqlstate = 28000; native error = 4818; message = [Microsoft][SQL Native Client][SQL Server]Login failed for user 'XXXXX\clusterservice'.
ERR SQL Server : [sqsrvres] ODBC sqldriverconnect failed

The error was clear .The cluster service was not able to login to SQL Server through user XXXXX\clusterservice but via a LOGIN ...That login is BUILTIN\Administrators.
But why it needs to login to SQL Server ?? Because it needs to run the isAlive check to make sure that the SQL Server is up and running .It also runs the looksalive (its a function)check but that does not need to query SQL Server .Is Alive check runs select @@servername and waits for the return message through ODBC client (in our case its SQL Server Native client).Thus the Isalive check was not able to create a trusted connection and we lost the access to Virtual server.

So, in a SQL Server 2005\2008 failover cluster installation, the cluster service account relies on membership in the BUILTIN\Administrators group to log on to SQL Server 2005\2008 to run the IsAlive check.If you remove the BUILTIN\Administrators group from a failover cluster, you must explicitly grant the MSCS service account permissions to log on to the SQL Server 2005 failover cluster.

The SQL Server 2005 resource starts an instance of the Sqlcmd.exe utility under the security context of the MSCS service account. Then, the SQL Server 2005 resource runs an SQL script over a dedicated administrator connection (DAC) that samples various dynamic management views (DMV). Because a DAC connection is used to collect some diagnostic data, the clustering service account must be provisioned in the SYSADMIN fixed server role. If later someone says that clustering service account cannot be provisioned in the SYSADMIN fixed server role, then we can create a login for cluster service account that is not given the SYSADMIN fixed server role .I have not tested it yet .So cannot confirm that this will work on not ...

Commands :
CREATE LOGIN [\] FROM WINDOWS WITH DEFAULT_DATABASE=[master]
EXEC master.sp_addsrvrolemember @loginame = N'\ ', @rolename = N'sysadmin'

happly learning .....
Regards
Abhay

Msg 22004, Level 16, State 1, Line 0 :Failed to open loopback connection. Please see event log for more information.Failed to open loopback connection. Please see event log for more information.

2010-09-20T20:37:00.000+05:30

I am back with one more solution :).The problem was simple but the error made me thinking ..I was trying to do xp_readerrorlog on a small file .But my SPID hanged ..after some time I got this error :

SQL Server error in QA :
Msg 22004, Level 16, State 1, Line 0
Failed to open loopback connection. Please see event log for more information.
Msg 22004, Level 16, State 1, Line 0
error log location not found

I read somewhere that this error comes when SQL Server Agent failes to come up.Yes my agent was down.But I faied to understand what is the relation between running xp_readerrlog and SQL Agent not running .Still, I tried to run the agent and got the error ...So something is related to SQL Agent here and that something is that If SQL Agent is not running , I can run xp_readerrlog successfully (I will prove it wrong later).

I checked the application logs immediately and got these errors :

Event Type: Error
Event Source: MSSQLSERVER
Event Category: (2)
Event ID: 17052
Date: 09/20/2010
Time: 18:24:22
User: N/A
Computer: abchaudh
Description:
Severity: 16 Error:10061, OS: 10061 [Microsoft][SQL Native Client]TCP Provider: No connection could be made because the target machine actively refused it.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 4d 27 00 00 0a 00 00 00 M'......
0008: 12 00 00 00 61 00 62 00 ....a.b.
0010: 63 00 68 00 61 00 75 00 c.h.a.u.
0018: 64 00 68 00 00 00 0e 00 d.h.....
0020: 00 00 6d 00 61 00 73 00 ..m.a.s.
0028: 74 00 65 00 72 00 00 00 t.e.r...

Event Type: Error
Event Source: SQLAgent$CORRUPT
Event Category: Service Control
Event ID: 103
Date: 09/20/2010
Time: 18:24:45
User: N/A
Computer: abchaudh
Description:
SQLServerAgent could not be started (reason: Unable to connect to server 'abchaudh\CORRUPT'; SQLServerAgent cannot start).

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Event Type: Information
Event Source: SQLAgent$CORRUPT
Event Category: Service Control
Event ID: 102
Date: 09/20/2010
Time: 18:24:52
User: N/A
Computer: abchaudh
Description:
SQLServerAgent service successfully stopped.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

I also checked the SQL Agent logs :

SQL Agent logs :

2010-09-20 18:25:36 - ! [298] SQLServer Error: 10061, TCP Provider: No connection could be made because the target machine actively refused it. [SQLSTATE 08001]
2010-09-20 18:25:36 - ! [165] ODBC Error: 0, Login timeout expired [SQLSTATE HYT00]
2010-09-20 18:25:36 - ! [298] SQLServer Error: 10061, An error has occurred while establishing a connection to the server. When connecting to SQL Server 2005, this failure may be caused by the fact that under the default settings SQL Server does not allow remote connections. [SQLSTATE 08001]
2010-09-20 18:25:36 - ! [000] Unable to connect to server 'abchaudh\CORRUPT'; SQLServerAgent cannot start
2010-09-20 18:25:42 - ! [298] SQLServer Error: 10061, TCP Provider: No connection could be made because the target machine actively refused it. [SQLSTATE 08001]
2010-09-20 18:25:42 - ! [165] ODBC Error: 0, Login timeout expired [SQLSTATE HYT00]
2010-09-20 18:25:42 - ! [298] SQLServer Error: 10061, An error has occurred while establishing a connection to the server. When connecting to SQL Server 2005, this failure may be caused by the fact that under the default settings SQL Server does not allow remote connections. [SQLSTATE 08001]
2010-09-20 18:25:42 - ! [382] Logon to server 'abchaudh\CORRUPT' failed (DisableAgentXPs)
2010-09-20 18:25:43 - ? [098] SQLServerAgent terminated (normally)

So, now I have 2 issues : SQL Agent is not running and xp_readerrorlog is timing out.

.If you see one of my older posts on "target machine actively refuses it" , you will have some information .

So I opened CLICONFG and found 3 incorrect aliases which were not using the right port ..

I removed them and SQL Agent came on line ...xp_read errorlog also started working ...

I stopped SQL Agent and stil everything was working ...
So the issue was that xp_readerrlog tries to connect to SQL Server but stucks due to the Alias pointing to incorrect port.

But this does not affect SQL Server service .To check why its not affecting SQL Server service , I disabled the sahred memory protocol and BANG....SQL Server connection failed ...

Happy learning ....

Could not load the DLL xpstar90.dll, or one of the DLLs it references. Reason: 126(The specified module could not be found.)

2010-08-25T14:48:00.000+05:30

Hi Team ,
This issue was faced by someone outside IBM but my main intention is to explain the benefit of another nice tool : Dependency Walker (http://www.dependencywalker.com/)

Issue :
SQL Server Agent failed to come up after the service account password was reset at AD level .

Error(s) :
In the event log you will see these errors in sequence :

Description:
Could not load the DLL xpstar90.dll, or one of the DLLs it references. Reason: 126(The specified module could not be found.).

Description:
Failed to retrieve SQLPath for syssubsystems population.

Description:
SQLServerAgent could not be started (reason: Failed to load any subsystems. Check errorlog for details.).

The first error is the main error and rest are the errors following the first error and we need not to think about them .

Troubleshooting and Resolution :
The error clearly says that either there is a problem with xpstar90.dll or the other dlls that this dll references .
This file is located in I first tried to re-register xpstar90.dll by using regsvr32 xrstar90.dll and got this message :

xpstar90.dll was loaded , But the dllRegisterServer entry point was not found.

I have heard that sometimes there is a different way of registering some DLLs , so by this error I did not come to the conclusion that this file is corrupt.
I was also thinking that there might be some other DDL that this DLL refers to , which got corrupted.

I decided to see the tree structure of xpstar90.dll in Dependency Walker . I opened the C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn\xpstar90.dll in it and go this output.

So, in this case XPSTAR90.DLL itself was corrupt .I found its version 2005.90.4035.0 and replaced it with another one that I had in another instance .
SQL Agent came online .
In case if it dos not , then we need to uninstall Native client from ADD Remove Programs and reinstall it .

Happy learning
Abhay

Error 1117 :The request could not be performed because of an I/O device error.

2010-08-02T16:40:00.001+05:30

Our backups were failing under these conditions :

Scenario 1: The System databases plus few user databases are on local disk & few user databases are on LUNs.

Scenario 2: The System & user databases are completely on LUNs

The backups were running for some good amount of time but then use to fail with Error 1117.I know that taking backups on network is not suported but I was breaking my head on this ERROR (1117)to know the reason behind this error .After going through a few tests on my machine using external HDDs ,my understanding of this error is :

-> Error 1117 is ERROR_IO_DEVICE .Thats fine .But I was curious about knowing the situations under which this error might occur and what is the exact meaning on this Error .Does Error_IO_Device means that the Hardware is corrupt ? Found that this error occurs under the below situations and then found the reasons behind those situations as well :

STATUS_FT_MISSING_MEMBER
ERROR_IO_DEVICE

An attempt was made to explicitly access the secondary copy of information via a device control to the fault tolerance driver and the secondary copy is not present in the system.

STATUS_FT_ORPHANING
ERROR_IO_DEVICE
{FT Orphaning} A disk that is part of a fault-tolerant volume can no longer be accessed.

STATUS_DATA_OVERRUN
ERROR_IO_DEVICE
{Data Overrun} A data overrun error occurred.

STATUS_DATA_LATE_ERROR
ERROR_IO_DEVICE
{Data Late} A data late error occurred.

STATUS_IO_DEVICE_ERROR
ERROR_IO_DEVICE
The I/O device reported an I/O error

STATUS_DEVICE_PROTOCOL_ERROR
ERROR_IO_DEVICE
A protocol error was detected between the driver and the device.

STATUS_DRIVER_INTERNAL_ERROR
ERROR_IO_DEVICE
An error was detected between two drivers or within an I/O driver.

So this error mapping says that this error will be thrown out if anyof these conditions are met .In my situation we were falling in into STATUS_DATA_LATE_ERROR since we were also getting thses entries in the SQL serve errorlogs : "x I/O requests are pending for more then 15 secs ............filename.mdf"

If you are running backup jobs you might also get this error -1073548784 .
This is a common error and may come when the query you are running remotely is incorrect , or the table you are trying to drop does not exist .Try to export a table that already exists in another DB and you will recreate this OLEDB error.So we need not to worry about finding the message identifier for this number .

Action plan :
-----------------
--try to take backup of another database located remotely and of near about same size . I mean around 20GB.

--Run Chkdsk on this drive or ask someone to do that and see if the consistency errors come up .

--Create a similar database on another external drive like this one and take the backup .

Conclusion :
---------------
I am very much certain that the issue is with the drive and(OR)Network.The 15 sec IO delay messages in Errorlogs also suggests the same .But as you can see this error also comes when dataa gets late in reaching the destination (STATUS_DATA_LATE_ERROR) I am suspecting that the network might also be a bit slow and contributing to the backup failure .

Now the ball is in your court how you explain this to the client :) .

Hapy Learning

Msg 8914, Level 16, State 1, Line 1 -> Incorrect PFS free space information for page

2010-07-27T15:56:00.000+05:30

Msg 8914, Level 16, State 1, Line 1
Incorrect PFS free space information for page (1:61991) in object ID 1993058136, index ID 1, partition ID 72057594955366400, alloc unit ID 71906736119218176 (type LOB data). Expected value 0_PCT_FULL, actual value 100_PCT_FULL.

This was the error we were getting in the Docs table of one of the Sharepoint database .The Compatibility level was 80 and Build was 1399 (2005 RTM).

I tried a lot of things on it like :
-> I Rebuilt the clustered index with and without LOB_COMPACTION option .
-> DBCC page shows is fill factor 100
-> Changed the fill factor to 100 explicitly
-> Ran dbcc updateusage
-> changed the compatibility level to 90
-> changed the fill factor to to 99 ,50 etc
Nothing helped .The profiler did not show much (my intention was to know what checkdb is doing internally).

Finally I took the backup of the database and restored it as a test database .It did not give any errors .This means that actually its not a corruption .

On the restored database Ran DBCC checkdb with repair_allow_data_loss .
It fixed the issue without harming the data .Finally Ran the same on the Sharepoint database and it resolved the issue .

Hope this gives you the confidence to run the repair_allow_data_loss for this issue .
But remember , almost every time if you run it with repair_allow_data_loss you will end up loosing the data .So be careful .

This situation was AN EXCEPTION and you can safely use this option of checkdb.

Root cause :
Microsoft says that
the engine (just like OS does which giving pages to processes ) pre-allocates a set of data pages (say X) to the SPID which needs it and marks them as 100% full in PFS assuming that those pages will eventually get filled very soon.It does this to avoid frequently updating PFS page and improving performance.But later when the SPID completes its work in less pages (say X-Y) , these remaining pages are released .However, the remaining pages should be marked again as empty (0_PCT_FULL) which it does not do and hence DBCC CheckDB reports those errors (SQL 2000 silently use to fix it ).Repair_allow_data_loss will fix it with no data loss actually.

Regards
Abhay

Finding the last date when the LOG/FULL/DIFF/FILEGROUP backup was taken for all the databases

2010-07-26T17:28:00.000+05:30

Hi Guys ,
While creating a few scripts , a requirement came where I had to find the last backups (all types) taken for all the databases (except tempdb) .
Please find the script below .Hope it helps you in your daily activities .If you want to automate it for all the instances in your environment , please let me know and I can send you some more files.

/*
Script : Last_bckp.sql
Author : Abhay Chaudhary,
Date : 26th JUL, 2010
Purpose : Collecting SQL Server 2000/2005/2008 last backup taken information.
Requirements : Do a CTRL+F and change the to the DB where you want to
create the object.
Suggestions : hi_abhay78@yahoo.co.in
Version : 1.0
*/

USE
set nocount on
if not exists (select * from ..sysobjects where name ='bckp_types' and type ='S')
begin
create table ..bckp_types (num int identity(1,1),type varchar(1),bkp_name varchar(20))
insert into ..bckp_types (type,bkp_name) values ('D','Full backup')
insert into ..bckp_types (type,bkp_name) values ('L','Log Backup')
insert into ..bckp_types (type,bkp_name) values ('F','Filegroup backup')
insert into ..bckp_types (type,bkp_name) values ('I','Differential backup')
end
go

Declare @loop int
select @loop= max(num) from bckp_types
While (@loop !=0)
begin
Select 'last ' + bkp_name +' taken details.' from bckp_types where num=@loop
declare @bk_type varchar(1)
select @bk_type = type from bckp_types where num=@loop

SELECT s.name 'database Name',
b.backup_finish_date 'last backup date',
bmf.physical_device_name 'location of backup'
FROM master..sysdatabases s LEFT OUTER JOIN msdb..backupset b ON s.name = b.database_name
INNER JOIN msdb..backupmediafamily bmf ON b.media_set_id = bmf.media_set_id
WHERE s.name <> 'tempdb'
AND b.backup_finish_date = (SELECT MAX(backup_finish_date)
FROM msdb..backupset
WHERE database_name = b.database_name
AND type = @bk_type)
ORDER BY s.name

set @loop=@loop-1
end
go
Drop table ..bckp_types

Happy Learning ...
Abhay

SQLServer Error: 848, SQL Network Interfaces: The system detected a possible attempt to compromise security.

2010-07-16T16:39:00.000+05:30

We faced a strange but simple issue yesterday and as usual I would like to share it with you .

Situation :
-------------
SQL server 2005 SP2
Windows Server 2003 SP2
Cluster : Yes 2 node A-P cluster

Service account of SQL Server Agent service and SQL Server service were same .SQL Server is Clustered .

While SQL Sevrer as well as agent were running fine the account under both the services are running ,got locked(we came to know this later as a rootcause of this issue).Still ,everything was fine and there was no issue since the account got locked after SQL Server and agent were started.

Then we found that all the jobs that were scheduled stopped working .In the job history we found that there is no JOB HISTORY created since the jobs stopped working .But there was not a single failure of the jobs .

Which means that the jobs were not scheduled by the Job schedular >> to the Threads >> to the SPIDs .So , we manually executed the jobs and all of them completed successfully .But again , there was no history being created and those jobs were not doing anything .For example , the backup job was running successfully when we ran it explicitly but no backups were taken .

To drill down further , we ran the commands under the jobs in QA and those were running fine .We created new jobs and there was no change at all in the situation .

Then we checked the SQL Agent logs and found this :

[298] SQLServer Error: 848, SQL Network Interfaces: The system detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you. [SQLSTATE HY000]
[298] SQLServer Error: 848, Cannot generate SSPI context [SQLSTATE HY000]
[382] Logon to server '(local)' failed (ConnAttemptCachableOp)

This was strange to us as why the connectivity error is not being displayed when we were explicitely executing the job, which completes successfully and doing nothing.
But since it was the connectivity error by agent , we decided to run the jobs by logging on to the server using the account under which SQL Server and agent are running.

We then found out the the account was locked under which SQL Server and agent were still running.

Once the account got unlocked at the AD ,the jobs ACTUALLY started working .

To me it looks like a bug in design and i have logged it on the CONNECT :
https://connect.microsoft.com/SQLServer/feedback/details/575388/strange-behaviour-in-sql-agent-job-on-cluster-where-the-job-runs-but-does-not-do-anything

hope it helos you to resolve your issue .

Data auditing in SQL Server

2010-06-19T20:32:00.000+05:30

There are 2 ways we can audit the SQL Server events to the tracefile (people call it audit log file).

- setting sp_configure parameter 'c2 audit mode' to 1.This will automatically capture all the audit events for all the databases and all the columns . You cannot modify it .Even if you try to , it will
not take the changes made manually .

- Creating our own trace for selected events and columns .Please check BOL for it .

In case you want to go through the second option and that is to create our own trace please see the demo below:

Step 1
In this step we are creating test_yasir trace in C: drive.Then we are setting the Events and columns adn settin gthem to ON .I have choosen a few events and columns .

declare @TraceIdOut int
exec sp_trace_create @TraceIdOut OUTPUT,6, N'c:\test_Yasir'
PRINT @TraceIdOut

declare @On bit
SET @On = 1
exec sp_trace_setevent @TraceIdOut, 14, 6, @On
exec sp_trace_setevent @TraceIdOut, 14, 7, @On
exec sp_trace_setevent @TraceIdOut, 14, 8, @On
exec sp_trace_setevent @TraceIdOut, 14, 9, @On
exec sp_trace_setevent @TraceIdOut, 14, 10, @On
exec sp_trace_setevent @TraceIdOut, 15, 6, @On
exec sp_trace_setevent @TraceIdOut, 15, 7, @On
exec sp_trace_setevent @TraceIdOut, 15, 8, @On
exec sp_trace_setevent @TraceIdOut, 15, 9, @On
exec sp_trace_setevent @TraceIdOut, 15, 10, @On
exec sp_trace_setevent @TraceIdOut, 20, 6, @On
exec sp_trace_setevent @TraceIdOut, 20, 7, @On
exec sp_trace_setevent @TraceIdOut, 20, 8, @On
exec sp_trace_setevent @TraceIdOut, 20, 9, @On
exec sp_trace_setevent @TraceIdOut, 20, 10, @On

exec sp_trace_setevent @TraceIdOut, 104, 1, @On
exec sp_trace_setevent @TraceIdOut, 104, 3, @On
exec sp_trace_setevent @TraceIdOut, 104, 6, @On
exec sp_trace_setevent @TraceIdOut, 104, 7, @On
exec sp_trace_setevent @TraceIdOut, 104, 8, @On
exec sp_trace_setevent @TraceIdOut, 104, 7, @On
exec sp_trace_setevent @TraceIdOut, 104, 8, @On
exec sp_trace_setevent @TraceIdOut, 104, 10, @On
exec sp_trace_setevent @TraceIdOut, 104, 11, @On
exec sp_trace_setevent @TraceIdOut, 104, 14, @On
exec sp_trace_setevent @TraceIdOut, 104, 22, @On
exec sp_trace_setevent @TraceIdOut, 104, 26, @On
exec sp_trace_setevent @TraceIdOut, 104, 35, @On
exec sp_trace_setevent @TraceIdOut, 104, 7, @On
exec sp_trace_setevent @TraceIdOut, 104, 8, @On
exec sp_trace_setevent @TraceIdOut, 104, 7, @On
exec sp_trace_setevent @TraceIdOut, 104, 8, @On

exec sp_trace_setevent @TraceIdOut, 107, 1, @On
exec sp_trace_setevent @TraceIdOut, 107, 3, @On
exec sp_trace_setevent @TraceIdOut, 107, 6, @On
exec sp_trace_setevent @TraceIdOut, 107, 7, @On
exec sp_trace_setevent @TraceIdOut, 107, 8, @On
exec sp_trace_setevent @TraceIdOut, 107, 7, @On
exec sp_trace_setevent @TraceIdOut, 107, 8, @On
exec sp_trace_setevent @TraceIdOut, 107, 10, @On
exec sp_trace_setevent @TraceIdOut, 107, 11, @On
exec sp_trace_setevent @TraceIdOut, 107, 14, @On
exec sp_trace_setevent @TraceIdOut, 107, 22, @On
exec sp_trace_setevent @TraceIdOut, 107, 26, @On
exec sp_trace_setevent @TraceIdOut, 107, 35, @On
exec sp_trace_setevent @TraceIdOut, 107, 7, @On
exec sp_trace_setevent @TraceIdOut, 107, 8, @On
exec sp_trace_setevent @TraceIdOut, 107, 7, @On
exec sp_trace_setevent @TraceIdOut, 107, 8, @On

exec sp_trace_setevent @TraceIdOut, 106, 1, @On
exec sp_trace_setevent @TraceIdOut, 106, 3, @On
exec sp_trace_setevent @TraceIdOut, 106, 6, @On
exec sp_trace_setevent @TraceIdOut, 106, 7, @On
exec sp_trace_setevent @TraceIdOut, 106, 8, @On
exec sp_trace_setevent @TraceIdOut, 106, 7, @On
exec sp_trace_setevent @TraceIdOut, 106, 8, @On
exec sp_trace_setevent @TraceIdOut, 106, 10, @On
exec sp_trace_setevent @TraceIdOut, 106, 11, @On
exec sp_trace_setevent @TraceIdOut, 106, 14, @On
exec sp_trace_setevent @TraceIdOut, 106, 22, @On
exec sp_trace_setevent @TraceIdOut, 106, 26, @On
exec sp_trace_setevent @TraceIdOut, 106, 35, @On
exec sp_trace_setevent @TraceIdOut, 106, 7, @On
exec sp_trace_setevent @TraceIdOut, 106, 8, @On
exec sp_trace_setevent @TraceIdOut, 106, 7, @On
exec sp_trace_setevent @TraceIdOut, 106, 8, @On

exec sp_trace_setevent @TraceIdOut, 105, 1, @On
exec sp_trace_setevent @TraceIdOut, 105, 3, @On
exec sp_trace_setevent @TraceIdOut, 105, 6, @On
exec sp_trace_setevent @TraceIdOut, 105, 7, @On
exec sp_trace_setevent @TraceIdOut, 105, 8, @On
exec sp_trace_setevent @TraceIdOut, 105, 7, @On
exec sp_trace_setevent @TraceIdOut, 105, 8, @On
exec sp_trace_setevent @TraceIdOut, 105, 10, @On
exec sp_trace_setevent @TraceIdOut, 105, 11, @On
exec sp_trace_setevent @TraceIdOut, 105, 14, @On
exec sp_trace_setevent @TraceIdOut, 105, 22, @On
exec sp_trace_setevent @TraceIdOut, 105, 26, @On
exec sp_trace_setevent @TraceIdOut, 105, 35, @On
exec sp_trace_setevent @TraceIdOut, 105, 7, @On
exec sp_trace_setevent @TraceIdOut, 105, 8, @On
exec sp_trace_setevent @TraceIdOut, 105, 7, @On
exec sp_trace_setevent @TraceIdOut, 105, 8, @On

Step 2 :

In this step we wil apply the filter since you said you need to audit only a user database.We will achieve it using sp_trace_setfilter

sp_trace_setfilter 3,3,0,0,1

In this example I have set the filter on databaseid (to 1 which is master) in traceid 3

Step 3:
In this step we will first confirm if our trace is showing up in the metadata.Do a select * from sys.traces and check the trace you created as well its trace id .
Then start the trace (which is 1) using sp_trace_setstatus

example :
sp_trace_setstatus 3,1

Here traceid is 3 and staus is 1
Further , if you want to add or remove any event use sp_trace_setevent after stopping the trace using sp_trace_setstatus

But in this method there is a problem .The problem is that , if you restart the instance the trace metadata will be washed from the sys.traces DMV.
So you will have to manually run it again .Further the physical trace file (log file) still exist.So you will get the error while creaing the trace .To over come this :

1) I have added the datetime in the file name .So it will create a unique file each minute.
2) I have encapsulated the query into an SP and pinned it to SQL Server startup.

So now

step 1 would be

create proc audit_trace as
declare @TraceIdOut int
Declare @D1 nvarchar(30)
Declare @D2 nvarchar(30)
Declare @D3 nvarchar(30)
Declare @D4 nvarchar(30)
Declare @D5 nvarchar(30)
Declare @trace_name nvarchar(256)

SELECT @D1=DATENAME(Day, GETDATE())
SELECT @D2=DATENAME(month, GETDATE())
SELECT @D3=DATENAME(year, GETDATE())
SELECT @D4=DATENAME(hour, GETDATE())
SELECT @D5=DATENAME(minute, GETDATE())

set @trace_name='c:\trace_'+@d1+'_'+@d2+'_'+@d3+'_'+@d4+'_'+@d5+'_'
print @trace_name

--set @trace_name = 'c:\trace_'+@trace_date+'.trc'
--print @trace_name
exec sp_trace_create @TraceIdOut OUTPUT,6, @trace_name
PRINT @TraceIdOut

declare @On bit
SET @On = 1
exec sp_trace_setevent @TraceIdOut, 14, 6, @On
exec sp_trace_setevent @TraceIdOut, 14, 7, @On
exec sp_trace_setevent @TraceIdOut, 14, 8, @On
exec sp_trace_setevent @TraceIdOut, 14, 9, @On
exec sp_trace_setevent @TraceIdOut, 14, 10, @On
exec sp_trace_setevent @TraceIdOut, 15, 6, @On
exec sp_trace_setevent @TraceIdOut, 15, 7, @On
exec sp_trace_setevent @TraceIdOut, 15, 8, @On
exec sp_trace_setevent @TraceIdOut, 15, 9, @On
exec sp_trace_setevent @TraceIdOut, 15, 10, @On
exec sp_trace_setevent @TraceIdOut, 20, 6, @On
exec sp_trace_setevent @TraceIdOut, 20, 7, @On
exec sp_trace_setevent @TraceIdOut, 20, 8, @On
exec sp_trace_setevent @TraceIdOut, 20, 9, @On
exec sp_trace_setevent @TraceIdOut, 20, 10, @On

exec sp_trace_setevent @TraceIdOut, 104, 1, @On
exec sp_trace_setevent @TraceIdOut, 104, 3, @On
exec sp_trace_setevent @TraceIdOut, 104, 6, @On
exec sp_trace_setevent @TraceIdOut, 104, 7, @On
exec sp_trace_setevent @TraceIdOut, 104, 8, @On
exec sp_trace_setevent @TraceIdOut, 104, 7, @On
exec sp_trace_setevent @TraceIdOut, 104, 8, @On
exec sp_trace_setevent @TraceIdOut, 104, 10, @On
exec sp_trace_setevent @TraceIdOut, 104, 11, @On
exec sp_trace_setevent @TraceIdOut, 104, 14, @On
exec sp_trace_setevent @TraceIdOut, 104, 22, @On
exec sp_trace_setevent @TraceIdOut, 104, 26, @On
exec sp_trace_setevent @TraceIdOut, 104, 35, @On
exec sp_trace_setevent @TraceIdOut, 104, 7, @On
exec sp_trace_setevent @TraceIdOut, 104, 8, @On
exec sp_trace_setevent @TraceIdOut, 104, 7, @On
exec sp_trace_setevent @TraceIdOut, 104, 8, @On

exec sp_trace_setevent @TraceIdOut, 107, 1, @On
exec sp_trace_setevent @TraceIdOut, 107, 3, @On
exec sp_trace_setevent @TraceIdOut, 107, 6, @On
exec sp_trace_setevent @TraceIdOut, 107, 7, @On
exec sp_trace_setevent @TraceIdOut, 107, 8, @On
exec sp_trace_setevent @TraceIdOut, 107, 7, @On
exec sp_trace_setevent @TraceIdOut, 107, 8, @On
exec sp_trace_setevent @TraceIdOut, 107, 10, @On
exec sp_trace_setevent @TraceIdOut, 107, 11, @On
exec sp_trace_setevent @TraceIdOut, 107, 14, @On
exec sp_trace_setevent @TraceIdOut, 107, 22, @On
exec sp_trace_setevent @TraceIdOut, 107, 26, @On
exec sp_trace_setevent @TraceIdOut, 107, 35, @On
exec sp_trace_setevent @TraceIdOut, 107, 7, @On
exec sp_trace_setevent @TraceIdOut, 107, 8, @On
exec sp_trace_setevent @TraceIdOut, 107, 7, @On
exec sp_trace_setevent @TraceIdOut, 107, 8, @On

exec sp_trace_setevent @TraceIdOut, 106, 1, @On
exec sp_trace_setevent @TraceIdOut, 106, 3, @On
exec sp_trace_setevent @TraceIdOut, 106, 6, @On
exec sp_trace_setevent @TraceIdOut, 106, 7, @On
exec sp_trace_setevent @TraceIdOut, 106, 8, @On
exec sp_trace_setevent @TraceIdOut, 106, 7, @On
exec sp_trace_setevent @TraceIdOut, 106, 8, @On
exec sp_trace_setevent @TraceIdOut, 106, 10, @On
exec sp_trace_setevent @TraceIdOut, 106, 11, @On
exec sp_trace_setevent @TraceIdOut, 106, 14, @On
exec sp_trace_setevent @TraceIdOut, 106, 22, @On
exec sp_trace_setevent @TraceIdOut, 106, 26, @On
exec sp_trace_setevent @TraceIdOut, 106, 35, @On
exec sp_trace_setevent @TraceIdOut, 106, 7, @On
exec sp_trace_setevent @TraceIdOut, 106, 8, @On
exec sp_trace_setevent @TraceIdOut, 106, 7, @On
exec sp_trace_setevent @TraceIdOut, 106, 8, @On

exec sp_trace_setevent @TraceIdOut, 105, 1, @On
exec sp_trace_setevent @TraceIdOut, 105, 3, @On
exec sp_trace_setevent @TraceIdOut, 105, 6, @On
exec sp_trace_setevent @TraceIdOut, 105, 7, @On
exec sp_trace_setevent @TraceIdOut, 105, 8, @On
exec sp_trace_setevent @TraceIdOut, 105, 7, @On
exec sp_trace_setevent @TraceIdOut, 105, 8, @On
exec sp_trace_setevent @TraceIdOut, 105, 10, @On
exec sp_trace_setevent @TraceIdOut, 105, 11, @On
exec sp_trace_setevent @TraceIdOut, 105, 14, @On
exec sp_trace_setevent @TraceIdOut, 105, 22, @On
exec sp_trace_setevent @TraceIdOut, 105, 26, @On
exec sp_trace_setevent @TraceIdOut, 105, 35, @On
exec sp_trace_setevent @TraceIdOut, 105, 7, @On
exec sp_trace_setevent @TraceIdOut, 105, 8, @On
exec sp_trace_setevent @TraceIdOut, 105, 7, @On
exec sp_trace_setevent @TraceIdOut, 105, 8, @On

/*adding the SP to execute at SQl Server startup */
exec sp_procoption N'audit_trace', 'startup', 'on'

Step 2 and 3 will be same as mentioned above in the begening .

Disadvantage
-------------
Simple ..Its resource consuming .Do not add a lot of columns in the trace .Do specifically what you want to audit.
It entirely depends what all columns you are auditing.
You need to keep the instance under testing phase and monitor the resource consumption due to tracing .
Clear the client that we need to make sure that we have fast disks ,more/faster CPUs, IO processing capabbilities and enough RAM in case they need to do extensive auditing (if there are performance issues).

Using WMI and SQL Agent to fire low memory threshold alert ...

2010-06-19T20:21:00.000+05:30

This will work perfectly ....The only thing I wanted to add to the table was when it alerts you it should also fill the column with available MBytes so that you know how much memory was available .......But after trying it for 2 days , I realized that the class through which I am checking another class (Perfmon >> memory >> Available Mbytes) does not have a column for this.I am using "_instance modification" class .May be its due to this that the alert is fired but the job fails when it inserts the availableMbytes ...because this column is not in _instancemodificationevent class...the error number also suggests that .

By the way this one will only alert if your RAM is > 256 every 10 seconds ....this is because I wanted to test it ....you need to modify it to < 256 and every 300 seconds ...so that you get alert every 5 mins or whatever you decide ....

/*******************************************************************************************
* This script will create an Alert to Monitor Physical RAM reaching a low threshold.
* The alert will run a job and the job will enter data in a table.
*******************************************************************************************/

/* Step 1: creating the table to capture the Event information */

USE Master
GO

IF EXISTS (SELECT * FROM dbo.sysobjects WHERE id = OBJECT_ID(N'[dbo].[memory]') AND OBJECTPROPERTY(id, N'IsUserTable') = 1)
DROP TABLE [dbo].[memory]
GO

CREATE TABLE [dbo].[memory] (
[PostTime] [datetime] NOT NULL default (getdate()) ,
[computerName] sql_variant Not Null ,
[RecordID] [int] IDENTITY (1,1) NOT FOR REPLICATION NOT NULL,
[Flag] [int] NOT NULL CONSTRAINT [DF_MEMORY_Flag] DEFAULT ((0)),
) ON [PRIMARY]
GO

CREATE INDEX [Memory_IDX01] ON [dbo].[memory]([recordid]) WITH FILLFACTOR = 100 ON [PRIMARY]
GO

/*Step 2 : Creating the Job that will enter values into the table we created above*/
/*Service account and sql operator option are optional*/

USE [msdb]
GO

IF EXISTS (SELECT job_id FROM msdb.dbo.sysjobs_view WHERE name = N'Capture memory Event')
EXEC msdb.dbo.sp_delete_job @job_name = N'Capture Memory Event', @delete_unused_schedule=1

GO
--DECLARE @ServiceAccount varchar(128)
--SET @ServiceAccount = N''
--DECLARE @SQLOperator varchar(128)
--SET @SQLOperator = N''

BEGIN TRANSACTION
DECLARE @ReturnCode INT
SELECT @ReturnCode = 0

IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]' AND category_class=1)
BEGIN
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[Uncategorized (Local)]'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
END

DECLARE @jobId BINARY(16)
EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N'Capture Memory Event',
@enabled=1,
@notify_level_eventlog=2,
@notify_level_email=3,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N'Job for responding to memory events',
@category_name=N'[Uncategorized (Local)]',
--@owner_login_name=@ServiceAccount,
--@notify_email_operator_name=@SQLOperator,
@job_id = @jobId OUTPUT

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

/*Step 3: Insert graph into LogEvents*/

EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'Insert data into LogEvents',
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N'TSQL',
@command=N'
declare @@server sql_variant
select @@server =serverproperty (''machinename'')

INSERT INTO memory (
PostTime,
Computername
)

VALUES (
GETDATE(),
@@server)
',
@database_name=N'master',
@flags=0

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION

GOTO EndSave

QuitWithRollback:
IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO

/*Creating the alert and associating it with the Job to be fired */

USE [msdb]
GO

IF EXISTS (SELECT name FROM msdb.dbo.sysalerts WHERE name = N'Respond to Memory_event')
EXEC msdb.dbo.sp_delete_alert @name=N'Respond to memory_event'

GO

DECLARE @server_namespace varchar(255)
SET @server_namespace = N'\\.\root\Cimv2\'

EXEC msdb.dbo.sp_add_alert @name=N'Respond to memory_event',
@message_id=0,
@severity=0,
@enabled=1,
@delay_between_responses=0,
@include_event_description_in=0,
@category_name=N'[Uncategorized]',
@wmi_namespace=N'\\.\root\Cimv2',
@wmi_query=N'SELECT * FROM __InstanceModificationEvent WITHIN 10 WHERE TargetInstance ISA ''Win32_PerfFormattedData_PerfOS_Memory'' AND TargetInstance.AvailableBytes > 256',
@job_name='Capture memory Event' ;

--EXEC msdb.dbo.sp_add_notification @alert_name=N'Respond to memory_event', @operator_name=N'Test', @notification_method = 1
--GO

--/* Step 5: Create a stored proc for sending the [Create_user] information as .CSV file */

--Create proc [dbo].[Deadlock_rpt]
--as
--DECLARE @SQL varchar(Respond to memory_event2000)
--DECLARE @date varchar (2000)
--DECLARE @File varchar(1000)
--select @date= convert(date,GETDATE())
--SET @SQL = 'select * from [Create_user] where flag = 0'
--SET @File = '[Create_user] report'+@date+'.csv'

--EXECUTE msdb.dbo.sp_send_dbmail
--@profile_name = 'test',
--@recipients = 'your email.com',
--@subject = 'low memory threshold reached...',
--@body = '***URGENT***Attached please find the low memory threshold report',
--@query =@SQL ,
--@attach_query_result_as_file = 1,
--@query_attachment_filename = @file,
--@query_result_header = 1,
--@query_result_separator = ' ',
--@query_result_no_padding = 1,
--@query_result_width = 32767

--/* Step 6: Changing the flag to 1 so that next time this information is not sent*/
--update dbo.[Create_user] set flag = 1 where flag = 0
--go