Search This Blog

Friday, July 22, 2011

What does resourcedb contain ?

Many a times this question has been asked (either in interviews or just out of curiosity) that what does resourcedb contain and why is it so important to SQL Server ?

Normally we can't see it and hence can't use it .However, there is a way you can use resource database .But be careful. If you mess up anything you might end up paying a heavy cost.

Since we are discussing about this ,there is one more point that I would like to touch here .Starting from SQL 2005 there are no system tables but DMVs for us to query .However, if you query sys.objects and filter it on type ='S' ,you will notice a lot of system tables listed in the output .So there are system tables and we can see them .But if you try to query them ,you will get an annoying 208 error stating that the object does not exist which is not correct.So in this post we will see how to query resource database and also in a similar manner other system and user database .

Let us see how can we use resource database and also query system tables.Start SQL Server with -m switch ( in single user mode ) . There are 2 ways :

1) Through services console (after adding -m do not click on OK but click on start )

2) Through DOS prompt

Once SQL Server has been started in single user mode , we can make only one connection . We will connect to SQL Server using DAC .DAC option can only be used in sqlcmd utility and not in OSQL or ISQL .Again there are two ways to do this .But before trying to attempt for DAC connection make sure you have enabled remote admin connection option via sp_configure (you can see the run value of 1):

1) connecting to SQL Server with DAC (using SQLCMD) using -A option

2)Connecting via MGT Studio
Open Management studio , It will prompt you to enter the instance name . Just before the instance name add admin:

you might or might not get this error :

If you get this error then click OK ( this error window will go away ) and then instead of Connect click on cancel .
you will see a clean screen like below :

Click on new query

You will again see the same connection popup :

This time click on connect and it will work :) .A new query window will open even though you will not see the databases in the left hand side pane but the connection is there and working ...

Run the query 'use mssqlsystemresource' and press F5 :) ...It will work :

Also if you query sys.sysdbreg sytem table (an alternative for sys.sysdatabases DMV )you will see the resource database :

This database is currently in read-only mode (trust me :-) ) . If you want to cross check this run dbcc shrinkdatabase (mssqlsystemresource) and you will get to know .
You can set it to read_write mode though by running : alter database mssqlsystemresource set read_write.Now the very fact that this table is Read-Only and that we cannot take the backup of this database ,proves that this database might not contain very important information ...if you query the tables you will see that certain static information is stored which the engine might be using from time to time .Something like we store the values in a temp table or a variable .So coming back to the original question , resource database contain a lot of static information which the engine might need from time to time for its internal use .

At the end you might ask "Whats the need to touch the system tables in the database ?"
The answer is : We normally do not need to do this (especially Resource DB ) but there are other databases (system as well as user ) having some information we can use to resolve some issues (by updating those tables as needed ). And to resolve those issues , we need to login in this way ....

Hope you have found it interesting .But remember , BE VERY CAREFUL WHEN YOU TRY TO PLAY AROUND WITH SYSTEM TABLES (as i said in the beginning).
Happy Learning !!

Sunday, June 26, 2011

Replication :Archiving partitioned and non-Partitioned tables (Without removing the articles from the publications)

Recently there was a request on the MSDN forums where the poster wanted to archive the replicated partitioned tables in the publisher database . I think it would be good to share the solution with everyone in this forum as well .
In this post we will see :

Part 1) archiving the replicated non-partitioned (normal) tables .
Part 2) archiving the replicated partitioned tables .

At the end ,you will notice one nice to know feature of partitioned tables .

Part 1) Archiving the replicated non-partitioned (normal) tables .
Publisher : DB2Migration
Subscriber : DB2Migration_Sub
Replication Topology : Transactional Replication
Articles : dbo.Test
Other details : Both tables have 10000 rows each after first synchronization.

Now we need to archive the test table in publisher database but want to keep the Subscriber untouched i.e. the rows in subscriber should not change .For this example we will delete all the rows of the publisher table .

How should we do it ?
If I delete any row on publisher , the same will be replicated to subscriber . One way might be to stop the log reader agent and then delete the rows .After this , I can fire sp_repldone on the publisher and start the logreader agent .yes this is perfectly achievable . Here we go ...

DBCC opentran will show :
No active open transactions.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

We will enable the LogReader agent now and will see that there are no transactions to be replicated.

Select count * from both the tables will show 0( zero ) and 10000 rows respectively .

After this we will insert 1000 rows in publisher table ( we need to be careful as the tables have primary key).As a result the subscriber now, has 11000 rows and publisher has 1000 rows .This is going to be costly when there will be millions of rows because delete (or update or insert )is a logged activity.There is one more drawback and perhaps more critical.The log reader agent is one per database .So if there are more than one publications on the same database and we run sp_repldone , then we will hurt other subscriptions and publications .So we have to be careful. Other way is to truncate the table (after moving the data to an archived table) but replicated tables cannot be truncated (why ???.....simple , truncate is a non-logged activity and log-reader agent reads the log file to find the transactions marked for replication using sp_replcmds).So to truncate the table ,you need to remove the article from the publication .If you want to do that the steps are :

1) Stop the log reader agent and distribution agent
2) Drop the article(s) from the publication
3) Archive the table to another table ( this will be a logged activity ) by Bulk insert or BCP or import export wizard
4) Truncate the table 5) Add the article again
6) Change the properties of all the articles in the publication properties to "Keep existing object unchanged" for option action "if name is in use " .This is the most important step and please cross check it a few times to make sure that "Keep existing object unchanged " is set
6) generate the snapshot again ....
7) start the log reader agent and distribution agent and initiate the new snapshot ....
Trust me .you are done :) ..But don't you think its lengthy and a bit risky ..Now lets see something new ...

Part 2) Archiving the replicated partitioned tables .
Let us first create 2 new databases followed by creating partition functions followed by partition schemes followed by partitioned tables followed by inserting data in the tables .

--creating database and filegroups
create database test

--Adding new files to the filegroups
USE [master]
ALTER DATABASE test ADD FILE ( NAME = N'test2', FILENAME = N'C:\Program Files\Microsoft SQL Server\test2.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [second]
ALTER DATABASE test ADD FILE ( NAME = N'test3', FILENAME = N'C:\Program Files\Microsoft SQL Server\test3.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [third]
ALTER DATABASE test ADD FILE ( NAME = N'test4', FILENAME = N'C:\Program Files\Microsoft SQL Server\test4.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [forth]
ALTER DATABASE test ADD FILE ( NAME = N'test5', FILENAME = N'C:\Program Files\Microsoft SQL Server\test5.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [fifth]

--The following partition function will partition a table or index into four partitions.
USE test

-- Creating partition scheme
use test
IF NOT EXISTS (SELECT * FROM sys.partition_schemes WHERE name = N'PS_test')
create PARTITION SCHEME [PS_test] AS PARTITION [PF_test] TO ([second],[third],[forth],[fifth])
--[Note if you want to have one filegroup for all the files then : create PARTITION SCHEME [PS_test] AS PARTITION [PF_test] All TO ([secondary]) ]

--creating table with constraint and assigning a partition scheme to it
create table test (dummy [int] primary key constraint test_c check ([dummy] > 0 and [dummy] <=20000)) on ps_test (dummy) --inserting values
declare @val int
set @val=1000
while (@val > 0)
insert into test..test values (@val)
set @val=@val-1

On Subscriber we will only create the same filegroups and add files to them :
--creating database and filegroups
create database test_sub

--Adding new files to the filegroups
USE [master]
ALTER DATABASE test_sub ADD FILE ( NAME = N'test_sub2', FILENAME = N'C:\Program Files\Microsoft SQL Server\test_sub_sub2.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [second]
ALTER DATABASE test_sub ADD FILE ( NAME = N'test_sub3', FILENAME = N'C:\Program Files\Microsoft SQL Server\test_sub_sub3.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [third]
ALTER DATABASE test_sub ADD FILE ( NAME = N'test_sub4', FILENAME = N'C:\Program Files\Microsoft SQL Server\test_sub_sub4.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [forth]
ALTER DATABASE test_sub ADD FILE ( NAME = N'test_sub5', FILENAME = N'C:\Program Files\Microsoft SQL Server\test_sub_sub5.ndf' , SIZE = 2048KB , FILEGROWTH = 1024KB ) TO FILEGROUP [fifth]

Once you are done create the publication on database TEST and the add the article TEST.Once Its done, the Test_Pub Publication is ready to publish.After this we will create the Subscription on this publication .Our subscriber database is test_sub.Once the initial snapshot is synchronized you will see the following values :

select OBJECT_ID('test..test')
select OBJECT_ID('test_sub..test')
select * from test.sys.partitions where object_id in (2105058535) order by partition_number
select * from test_sub.sys.partitions where object_id in (133575514) order by partition_number

So as of now everything is as per plan . The data is synchronized in the correct partitions .Now if we need to archive the publisher table we can try the same old approach that we used in stage 1 .However , we will try something new here .That something new is SWITCHING OF PARTITIONS in the table .I will not explain what does it means because you will see it in few seconds(or you can refer BOL) .

--Lets first create the archive table on publisher .Its the replica of the original test table
create table test..test_archive (dummy [int] primary key constraint test_c_a check ([dummy] > 0 and [dummy] <=20000)) on ps_test (dummy) --Switching the partitions from test to test_archive table on publication
ALTER TABLE test..test SWITCH PARTITION 1 TO test_archive Partition 1;

Msg 21867, Level 16, State 1, Procedure sp_MStran_altertable, Line 259
ALTER TABLE SWITCH statement failed. The table '[dbo].[test]' belongs to a publication which does not allow switching of partitions

Oopsss , What happened...Yes , that is true , we cannot switch the partitions for the replicated table , unless ..........we explicitly allow partition switching for the publisher :
sp_changepublication 'test_pub' ,@property='allow_partition_switch',@value='true'

You will get this message
The publication has updated successfully.

--Switching the partitions from test to test_archive table on publication. This is for partition number 1 . We have 4 partitions.
ALTER TABLE test..test SWITCH PARTITION 1 TO test_archive Partition 1;
ALTER TABLE test..test SWITCH PARTITION 2 TO test_archive Partition 2;
ALTER TABLE test..test SWITCH PARTITION 3 TO test_archive Partition 3;
ALTER TABLE test..test SWITCH PARTITION 4 TO test_archive Partition 4;

Now , just check the number of rows in the tables test and test_archive on publisher test and the test table on subscriber test_sub:

Thats the magic :)..Did you also notice that we did not create any partitions for the test_archive table .Lets query the syspartition DMV and notice the partitions in test_archive table .

Thats the beauty .You did not have to delete or truncate a single row.Nor did you remove the article or stopped any agent . Now if you add rows to table TEST it will be replicated to subscriber as usual .lets try this by inserting 1000 rows in table test in the publisher and then check the subscriber table
--inserting new values in test table ( in publisher database )
declare @val int
set @val=2000
while (@val > 1000)
insert into test..test values (@val)
set @val=@val-1

Suggestions are welcome as we are here to help each other technicaly grow.Happly learning

Saturday, June 18, 2011

ORA-00942: table or view does not exist .

The complete error is :
select xactts, subguid from MSrepl7 where pubsrv = ? and pubdb = ? and indagent = 0 and subtype = 0
ORA-00942: table or view does not exist

We were setting up heterogeneous replicaton between SQL and Oracle 9i.This was done successfully.But when we tried synchronize the Articles (actually we were replicating only a view)we got stuck at this error.

It was clear that the error is coming from the subscriber .But we were not replicating the object : MSrepl7 and we dont know whether this is a table or view .Since we were not replicating it , I was sure that this is a table or view that replication is creating .I saw a KB article which talks about this table for DB2 : .

Later I found from other oracle subscribers that MSrepl7 is nothing but a replica of msreplication_subscriptions table in SQL Server subscribers. This table is looked up and matched with msrepl_transactions .The columns that are compared are transaction_timestamp in MSrepl7 with xact_seqno in msrepl_transactions table .

moving forward we wanted to find why this table does not exist on the subscriber . I suspected that it should be created while Subscription setup or while we reinitialize subscription or while synchronizing .

To see if its really being created I enabled tracing on Distribution agent since its this job which is failing .
-Output C:\Temp\OUTPUTFILE.txt -Outputverboselevel 2.

Ran the Agent again which failed with same error .checked the file:

OLE DB Subscriber 'ERPDEV.WORLD': create table MSrepl7 (pubsrv varchar2 (128), pubdb varchar2 (128), publcn varchar2 (128), indagent number (1, 0),subtype number (10, 0), dstagent varchar2 (100),timecol date,descr varchar2 (255), xactts raw (16), updmode number (3, 0), agentid raw (16), subguid raw (16), subid raw (16), immsync number (1, 0))

Connecting to Distributor 'MCMSMESVS1.distribution'
[4/21/2010 1:32:46 AM]MCMSMESVS1.distribution: {call master..sp_MScheck_agent_instance(N'MCMSMESVS1-MES-ERPDEV.WORLD-1', 10)}
OLE DB Subscriber 'ERPDEV.WORLD': select xactts, subguid from MSrepl7 where pubsrv = ? and pubdb = ? and indagent = 0 and subtype = 0
Agent message code 20046. ORA-00942: table or view does not exist

[4/21/2010 1:32:46 AM]MCMSMESVS1.distribution: {call sp_MSadd_distribution_history(1, 6, ?, ?, 0, 0, 0.00, 0x01, 1, ?, -1, 0x01, 0x01)}
Adding alert to msdb..sysreplicationalerts: ErrorId = 12,

This means that the table is being created.But its not there .Strange.

To check whats happening ,I connected to Oracle server and tried to create a test table ....

Connected to:
Oracle9i Enterprise Edition Release - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release - Production

SQL> create table test(t int);
create table test(t int)
ERROR at line 1:
ORA-01031: insufficient privileges

Looks like when we start synchronizing it does not create sufficient objects at Oracle side due to permission issue .Sadly , SQL also does not throw any errors that we cannot create the object due to permission issue (might not have put the Try catch for this error).

So,I requested the Oracle DBA in charge of this to give the appropriate permissions to the login that is executing the distribution agent job and execute the Distributor job again .

BINGO ........the issue was resolved ...hope this blog helps someone someday ..


What should we do first : Try to find the solution or try to find the Problem

Had I been asked this question a few years ago , I would have said "I would search for a solution" .

Most of us do this i.e.we first try to find the solution .Sometimes we succeed but most of the times we do not.After many unsuccessful attempts I realized that the step to find the solution goes through another step first, and that is Finding the problem .Not going deep in to it .

A couple of months back one of my collegue came to me with a problem " There is a job that fails on every Monday" .This job takes some values from somewhere and inserts it in SQL Server tables.The error was :
Msg 241, Level 16, State 1, Line 2
Conversion failed when converting datetime from character string.

Earlier ,my collegue explained the client that this is not a SQL issue and suggested the poor client to touch base with the application team .But the client was not a fool .He said that there is some problem in SQL Server and he don't want to go to DEV without proof.He is not a techie though.

What Should I do , Google it or BING it :-) .we did not do that .
you can see this message in sysmessages . [select * from sys.messages where message_id=241]

We decided to reproduce the issue and with in 15 mins , we proved that the format in which Date is entered at the application level should be incorrect and datetime datatype is not recognizing it .

Repro of the issue

Repro 1
declare @date datetime ,@string char(100)
select @date =getdate()
set @string =@date

Command(s) completed successfully.

Repro 2
declare @date datetime ,@string char(100)
select @date ='28/07/2010'
set @string =@date

Msg 242, Level 16, State 3, Line 2
The conversion of a char data type to a datetime data type resulted in an out-of-range datetime value.

So we are near as this error is almost similar to 241 , but not the same.

Repro 3
declare @date datetime ,@string char(100)
select @date ='2010/07/28'
set @string =@date

Command(s) completed successfully.

Repro 4
declare @date datetime ,@string char(100)
select @date =NULL <-- assuming someone might be putting NULL in date and since NULL can be anything, it might not be a CHAR and we will get the error.
set @string =@date
select @string
Command(s) completed successfully. <-- We did not Repro 5

declare @date datetime ,@string char(100)
select @date ='NULL'
set @string =@date
select @string

Msg 241, Level 16, State 1, Line 2
Conversion failed when converting datetime from character string.
here we got the error .

Repro 6
declare @date datetime ,@string char(100)
select @date ='NULL'

This is more clear
Msg 241, Level 16, State 1, Line 2
Conversion failed when converting datetime from character string.

So the reason could be :
The application adds single quotes to any entry .For example NULL will be converted to 'NULL' and 2010/07/28 will be converted to '2010/07/28' . In this case 'NULL' will give 241 but the date will be absolutely correct and will not throw error when inserted in the table (inside SQL Server).

Conclusion :
Always try to look for the reason behind the error/issue first rather jumping for solutions here and there .It might take time but you will learn more .


Immediate Sync option in Transactional Replication : Good or Bad ..

Around 4 months back we faced a latency issue in replication. We used tracer tokens and found that the distribution agent was lagging behind .Before I go forward, let me explain you that the distribution agent has 2 threads .Reader thread reads the value from the MSrepl_transactions table (this activity happens in parallel with the log reader agent where the rows are pumped in to the msrepl_transactions table) from distributor database and the writer thread apply those commands to the subscriber .

To find out where we are getting delayed we configured the verbose logging with level 2 in the distribution agent job ( In the output we saw that there is a good time gap after the sp_MSget_repl_commands command is fired and the next command .Now this command is fired to read the distribution database and populates the rows in the memory tables .These rows are then read by the writer thread and inserted into the subscriber database.

In the verbose log you can also check the statistics which is fired every 5 mins (starting SQL Server 2005 ):
•Cumulative Update 12 for SQL Server 2005 Service Pack 2
•Cumulative Update 5 for SQL Server 2008
•Cumulative Update 2 for SQL Server 2008 Service Pack 1

you will see data between these lines

*************STATISTICS SINCE AGENT STARTED *******************

It was clear that the reader thread was taking a lot of time in retrieving the rows as compared to the writer thread writing those commands to the subscriber database .

The next stage is to check the msrepl_transactions and msrepl_commands tables .We checked the number of rows in those tables and found that there a millions of rows which are replicated but still showing up in the in those tables . This was strange . We checked the cleanup job and found that the job was running fine .

So whats the issue ? Why the job is not removing the replicated rows ??
upon digging deep , we found that the DBA selected the "Create a snapshot immediately and keep the snapshot available to initialize subscriptions" option when he configured the replication .As a result Replication is suppose to keep all the transactions cached in the Distribution database for the entire Retention Period.You will also see all the snapshot files in the snapshot folder on the distributor .

Why it happens :
Every new subscriber that is added with in the subscription retention period needs the initial snapshot and then the data from the logreader is applied over the snapshot that is accumulated in the distribution database. But the Snapshot gets OLD too as the database image changes from time to time .Therefore every time a new snapshot is needed first. Because of this option set, the same old snapshot is applied first and then all the remaining LogReader entries from the distribution database .That is the reason all the old transactions are kept till the subscription expires.

Why is it configured ?
This command is configured in the environment where there is a need to create the subscription quite often and also if the snapshot increases in size quite considerab;y over the time .

How is it configured ?
Commandline :
via sp_addpublication <--Check BOL GUI :

Drawback :
Due to this option the size of Msrepl_transactions and Msrepl_commands increases ( which is more than normal) which slows up the synchronizations and clogs up the system.

How to disable it ?
EXEC sp_changepublication
@publication = 'your_publication_name',
@property = 'immediate_sync',
@value = 'false'

Happy Learning !!!!!!!

Address translation: How Virtual Memory is mapped to Physical Memory

We all know that data retrieval will be fast if the data pages are found in RAM .If the data pages are not in RAM, they are fetched into RAM from the disk. This causes a physical IO .The page remains in the RAM until it’s again kicked off to Disk.
But the process and threads do not access the Physical memory (RAM) directly .Instead the RAM is accessed indirectly through Virtual Memory or Virtual address space (VAS) pointers. On a x86 operating system the number of such pointers in virtual memory that can point to physical memory is 4,294,967,296 (2^32) .This is equal to 4 GB .Out of this 4GB VAS pointers, 2GB worth of pointers are located in the Kernel address space and remaining 2GB in the User Address Space .It’s this 2GB of user address space which is used by the Processes and threads for their use and to map it to RAM. Other 2GB Kernel Address space is also mapped to RAM for the OS routines and APIs .So, normally on a 32 bit windows OS, SQL Server will use 2GB RAM (1.66GB Buffer Pool region and 384MB Mem2Leave region) .
But what’s the need for Virtual memory when we have Physical memory and it is the real memory .Let me correlate this to a smart Bank. The bank started with $5000.A customer deposited $1000 and the bank will return $1100 after a year .after 1 month, Another customer deposited $2000 for 1 year and the Bank will return $2200 after a year. So the bank has now $8000 for around 2 years .Then someone took a loan of $3000 for 1 year and the bank will get $4000 after a year .In between if the earlier 2 clients want to withdraw their deposit before time, they can pay the penalty and Bank has sufficient money to give back from the initial investment .In reality the banks or the moneylenders keep revolving the money which they might not even have.

I hope you have some idea now. The OS also works like this .It assures every process 4GB of memory .Right now there are 119 processes running on my laptop .If we go by this fact then the OS is ensuring 476 GB RAM to the processes .But I have only 2GB RAM on this laptop. That’s where the Virtual memory comes into picture .This 476 GB is actually a virtual memory address space and nothing else ; which does physically not exist .OS memory managers maps this virtual memory to Physical memory (RAM) .During this process the Page file on the disk is also used if a thread needs more physical memory than available .

Let’s skip discussing about AWE, /3GB, /USERVA and PAE for now as it will divert us from the topic which is to know how the Virtual addresses get translated to Physical addresses in RAM.

Address translation is the process of translating the virtual memory to physical memory.

Every Virtual Address has 3 components:

The Page Directory Index : For each process the OS memory manager creates this directory to use it to map all the page tables for the process.The address of this directory in stored somewhere in the address space called as KProcess Block (Kernal Process Block). To keep this subject less complex I will not explain what Kprocess Block is .The CPU keeps track of this Page Directory Index via a register called as CR3 or Control Register 3 .This Register also resides in KProcess Block .So the CPU’s MMU knows where the Page Directory Index is located with the help of This register (MMU: ttp:// ).So the first 10 bits of the address space pointer has Page Directory Index value (there are a lot of page directory entries).This tells Windows which page table to use to locate the physical memory
associated with the address.

The Page Table Index : The second 10 bits of a 32-bit virtual address provide an index into this table and indicate which page table entry (PTE) contains the address of the page in physical memory to which the virtual address is mapped.

The Byte Index: the last 12 bits of a 32-bit virtual address contain the byte offset on the physical memory page to which the virtual address refers. The system page size determines the number of bits required to store the offset. Since the system page size on x86 processors is 4K, 12 bits are required
to store a page offset (4,096 = 2^12).

Let’s summarize it now:
1. The CPU’s Memory Management Unit locates the page directory for the process using the special register mentioned above.
2. The page directory index (from the first 10 bits of the virtual address) is used to locate the (P)age(D)irectory(E)ntry that identifies the page table needed to map the virtual address to a physical one.
3. The page table index (from the second 10 bits of the virtual address) is used to locate the PTE that maps the physical location of the virtual memory page referenced by the address.
4. The PTE is used to locate the physical page. If the virtual page is mapped to a page that is already in physical memory, the PTE will contain the page frame number (PFN) of the page in physical memory
that contains the data in question. If the page is not in physical memory, the MMU raises a page fault, and the Windows page fault–handling code attempts to locate the page in the system paging file. If the page can be located, it is loaded into physical memory, and the PTE is updated to reflect its location. If it cannot be located and the translation is a user mode translation, an access violation occurs because the virtual address references an invalid physical address. If the page cannot be located and the translation is occurring in kernel mode, a bug check (also called a blue screen) occurs.

How the Address translation happens with PAE in place:
Everything is same as above except that:
1)There is a new table which is above PDEs and PTEs .Its Page Directory Pointer Table.
2)The PTEs and PDEs are 64 bit wide as compared to 32 bit wide when PAE is not enabled.

How to resolve setup errors when registry keys are missing (Failed to install and configure assemblies C:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll in the COM+ catalog)

Hi Guys ,
A couple of weeks back we had a setup issue where the SP3 setup was failing on Notification services and Client components .I am sharing my experience with you because it took us a lot of time to figure out that a few registry keys were missing .This explanation has 3 parts viz.

First Part : Solution of the issue .
Second Part : Reproducing this issue and finding a solution in a better way so that the same type of process can be followed for other similar setup issues as well .
Third Part : 2 bugs (Unfortunately , I could not reproduce it on my machine, but repeatedly reproduced it on clients machine .Will still try on my machine and file them later.

First Part : Issue and its solution .
As per the Security Bulitin MS09-004 we rasied a change to patch the DEV (and later would be Prod Servers)server to SQL Sevrer 2005 SP3 .Since we were failing on the DEV Server , we were not able to initiate the setup on Prod until the setup on Dev is resolved . Till then the Prod was under potential threat of SQL injection. The issue was that we were not able to upgrade SQL Server 2005 RTM (32 bit) to SQL Server 2005 SP3 (32 bit) on Windows Server 2008 (64 bit) .

Error Messages
Setup Errors :
Failed to install and configure assemblies C:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll in the COM+ catalog.
Error: 2148734209
Error message: Unknown error 0x80131501
Error descrition: MSDTC was unable to read its configuration information. (Exception from HRESULT: 0x8004D027)

Errors in Application log :
Unable to get the file name for the OLE Transactions Proxy DLL. Error Specifics: hr = 0x8004d027,
d:\rtm\com\complus\dtc\dtc\xolehlp\xolehlp.cpp:176, CmdLine: C:\Windows\syswow64\MsiExec.exe -Embedding 0E2EA3ADA5DC17201B521333814654B2 C, Pid: 3484

Failed to read the name of the default transaction manager from the registry. Error Specifics: hr = 0x00000002,
d:\rtm\com\complus\dtc\dtc\xolehlp\xolehlp.cpp:382, CmdLine: C:\Windows\syswow64\MsiExec.exe -Embedding 0E2EA3ADA5DC17201B521333814654B2 C, Pid: 3484

Errors in SQL Server :
QueryInterface failed for "ITransactionDispenser": 0x8004d027(XACT_E_UNABLE_TO_READ_DTC_CONFIG).

Client Environment Details
SQL Server 2005 RTM 32 bit Standard Edition (WOW mode)
Windows 2008 64 bit Standard

Troubleshooting done
-> The setup was failing only for Notification services and Client tools .Rest all components were successfully upgraded to SP3 .
-> Found the Both Notification services and Client tools were failing because of same error :

---------------------------------------------------------------------------Product : Client Components
Product Version (Previous): 1399
Product Version (Final) :
Status : Failure
Log File : C:\Program Files (x86)\Microsoft SQL Server\90\Setup Bootstrap\LOG\Hotfix\SQLTools9_Hotfix_KB955706_sqlrun_tools.msp.log
Error Number : 29549
Error Description : MSP Error: 29549 Failed to install and configure assemblies c:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll in the COM+ catalog. Error: -2146233087
Error message: Unknown error 0x80131501
Error description: MSDTC was unable to read its configuration information. (Exception from HRESULT: 0x8004D027)

---------------------------------------------------------------------------Product : Notification Services
Product Version (Previous): 1399
Product Version (Final) :
Status : Failure
Log File : C:\Program Files (x86)\Microsoft SQL Server\90\Setup Bootstrap\LOG\Hotfix\SQLTools9_Hotfix_KB955706_sqlrun_tools.msp.log
Error Number : 29549
Error Description : MSP Error: 29549 Failed to install and configure assemblies c:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll in the COM+ catalog. Error: -2146233087
Error message: Unknown error 0x80131501
Error description: MSDTC was unable to read its configuration information. (Exception from HRESULT: 0x8004D027)
-> Detailed MSP logs showed that the error comes when MSDTC is trying to register the notification services using regsvcs

Error: 2148734209
Error message: Unknown error 0x80131501
Error descrition: MSDTC was unable to read its configuration information. (Exception from HRESULT: 0x8004D027)
Error Code: -2146233087
Windows Error Text: Source File Name: sqlca\sqlassembly.cpp
Compiler Timestamp: Sat Oct 25 08:47:00 2008
Function Name: Do_sqlAssemblyRegSvcs
Source Line Number: 155

-> So the issue was with registering Notification Services DLL file which was common to both NS and Client components .As you can see RegSvcs was being called from inside the function Do_sqlAssemblyRegSvcs.

-> We then tried to manually register the microsoft.sqlserver.notificationservices.dll through command prompt :
%windir%\Microsoft.NET\Framework64\v2.0.50727\RegSvcs.exe /fc "C:\Program Files (x86)\Microsoft SQL Server\90\NotificationServices\9.0.242\Bin\microsoft.sqlserver.notificationservices.dll"

-> This did not work and we were getting some other error .

-> Since COM+ Catalog was also showing up in the error ,I ran the SQL Server RTM setup and found the the SCC ( System configuration check ) is failing on COM + Catalog requirements .
->So,I checked the Component Services (DCOMCNFG) and found that there are no issues there .Everything was working fine .We also noticed that there were no errors related to COM components in the Application logs as well except the error related to MSDTC .The 32 bit COM components were also running fine .This was strange but made me believe that this error might be misleading .

-> We checked the SQL Sevrer errorlogs and found the same entry related to MSDTC , but in a slightly different manner :
QueryInterface failed for "ITransactionDispenser": 0x8004d027XACT_E_UNABLE_TO_READ_DTC_CONFIG)

-> We also saw some errors related to MSDTC :
Unable to get the file name for the OLE Transactions Proxy DLL.
Error Specifics: hr =
0x8004d027,d:\rtm\com\complus\dtc\dtc\xolehlp\xolehlp.cpp:176, CmdLine: C:\Windows\syswow64\MsiExec.exe
-Embedding -E2EA3ADA5DC17201B521333814654B2 C, Pid: 3484

Failed to read the name of the default transaction manager from the registry. Error Specifics: hr = 0x00000002,
CmdLine: C:\Windows\syswow64\MsiExec.exe
-Embedding 0E2EA3ADA5DC17201B521333814654B2 C, Pid: 3484

-> Now the picture was clear that MSDTC had issues for sure. Also the COM+ Catalog warning is related to MSDTC issue (I had resolved the same issue a few years back : But it was on Win server 2003)

-> The first error code is same 0x8004d027 but the second error code , which is coming before 0x8004d027 is 0x00000002 is nothing but telling us that there is some registry key missing ( there might be more ) . error 2 means "System cannot find the file specified".

-> Since we were not sure of which key was missing we decided to uninstall and reinstall MSDTC . This will automatically recreate all the missing registry keys.

-> But there was some more twist left. On windows Server 2008 we cannot simply uninstall MSDTC by using command "msdtc -uninstall" .We have to remove the MSDTC server role >> reboot the server >> re-add MSDTC in the Server roles.

-> We tried that ,but even after removing MSDTC from the server role , it was still showing as running in the services console .We tried a few times without success.

-> In between we also tried to repair .net framework 3.5 SP1 which did not help .

-> I also went through this article which talks about making sure that all the MSDTC related keys are present in the registry hive: was the 1st step towards solution , although it did not work.)

-> However , the keys mentioned in this article were present in both WOW mode and normal mode in the Registry .

-> Then on my machine I tried to find the registries with the value "OLETransactionManagers". The first Key hit was : HKEY_CLASSES_ROOT\OLETransactionManagers

-> Since SQL Server was running in WOW mode on client's server , we tried to find HKEY_CLASSES_ROOT\Wow6432Node\OLETransactionManagers key and got it in first attempt.

-> However , on my laptop this key had some values while there were no values on the Client side registry key (mentioned above).

-> We found that these 64 bit MSDTC registry keys were there but the 32 bit (WOW mode) registry keys related to MSDTC were missing .

-> We created these keys in WOW mode

String value name = DefaultTM and value data=MSDTC

String value name=DLL and value data=MSDTCPRX.DLL

->We restarted MSDTC and SQL Server . We checked the SQL Server errorlogs and found that MSDTC related error is no longer showing and also the SQL Server SCC check was not showing that error related to COM+ Catalog any more.

->This gave us some hope and with that hope we hoped that the setup will run successfully .This in turn did happen (",).

Second Part: Reproducing the error
I reproduced this on my machine by removing all the keys from HKEY_CLASSES_ROOT\OLETransactionManagers

When I ran the setup it gave me the same warnings on SCC :

I will still go ahead with the setup and select only the client components to install :

The setup will encounter this error:

Let’s check the Application logs:

At this stage First let me introduce a tool that can show you how to find the missing registry key or if there are any permission related issues .I have been using this tool since long now .This tool is Procmon from sysinternals".We will use this tool to find the missing registry keys .However this tool captures a lot of information of all the processes .So we need to first PAUSE it as soon as we launch it .

Now, we will find the process ID of MSIEXEC.exe from Task Manager since as per the Application logs its the msiexec command that is failing:

I will now filter it on those 3 PIDs

Once it’s done, uncheck the PAUSE icon (the magnifying glass) and click on the retry option on the setup window. Quickly after that ,again PAUSE the Procmon by clicking on the magnifying glass icon. Notice the output in the Procmon. You will see a lot of different keys there .We now need to delete not necessary entries:

Right click on SUCCESS and select:

Right click on RegOpenKey and select:

You are now left with only 6 keys to look at(You can follow the same steps for other missing Reg keys issues ) and all these keys are REGQueryValue:

Now ,notice the error copied above from application log:

Now, double click on each of the 6 keys (start from the bottom most) and select the option >> PROCESS. You will notice the same command line string as showing in the application logs:

So the first key that is missing is HKCR\OLETransactionManagers\DefaultTM .The reason it’s not showing all the missing keys is because it’s failing on first key itself. If you create the first key and then click on retry button on the setup the Procmon will show you the next missing key.

But what should be its value .We can check on other servers (preferably same server version).Create the missing keys and click on retry button .The setup will be successful (in our case it was SP3 and not the initial setup .But the resolution is same).

Third part : The Bugs
1) The issue is , when the SP3 setup fails it should ideally rollback everything to normal .So the NS and client tools should work fine.But On the client's server we saw that the Management Studio stop working and throws an error ( I dont have the screenshot now . Will try and reproduce it ) . The only solution is to uninstall the client tools and reinstall again .During the setup you will again get this error .Click on ignore and the setup will complete .

2) On my laptop , I saw that it also corrupt other components of SQL Server 2008 like Books on Line ..Again I do not have that proof now but will reproduce it .The solution is to uninstall client tools and reinstall tools of SQL Server 2008.

Happy Learning !!

Understanding how SQL Server behaves in a non-preemptive mode while still running on the OS which is preemptive

Today , I was asked by someone how SQL Server behaves in a non-preemptive way while still running on the OS which is preemptive .Even though I explained this theoritically ,I was requested if there is a way we can see it practically . It wasn't difficult but it really proved to what I explained .I alos felt that if we add practicals to theory it will have much deeper impact ...

NOTE : I am doing all the testing on SQL Server 2008 RTM EVAL as what I am trying to explain can be done in 2008 onwards.On 2005 you will not see what is written here.So I would request to use SQL 2008 and above .

Windows Scheduling (Preemptive) :
Starting from Windows NT 3.1 (XP ,2000,2003 etc) Windows scheduling was priority driven i.e preemptive scheduling .So every thread will have a priority associated with it .Based on this priority the threads will get the time slice (Quantum) to run on the CPU .So , even if thread of lower priority is running and all of a sudden another thread of a higher
priority comes up , the low priority thread will be preempted interrupted) and the higher priority thread will be scheduled to run on the CPU.However , the scheduler is smart enough .It will keep the preempted thread on the top of the waiting threads by adjusting its priority (Lets not get too deep in to this at this point).

Prior to this OS scheduling was non-preemptive i.e. cooperative scheduling .Remember the days when Windows 98 use to hang and we use to reboot the server quite often ,to get rid of it.Cooperative scheduling is good if all the threads leave the CPU after some time and give chance to other threads (including kernal mode threads which are more
Important and get a chance to run whenever required) after a fixed interval of time . But that normally does not happen .some nasty application threads don't yield and hence blocks other threads.

SQL Server Scheduling (non-preemptive) :
SQL Server has its own scheduling mechanism and it does not follow OS scheduling (looks strange as it runs on the preemptive OS) .Its called as UMS (User Mode Scheduling) in 2000 and SOS (SQL OS) in 2005 and above .BUT :
1) Why SQL Server does not hang just like windows 98 use to ?
Answer: SQL Server does not hang because its threads yield every voluntarily.In case a thread does not yield in 60 seconds (unlike the faulty application where the threads does not yield)SQL Server throws non-yielding scheduler hung error and throws a mini dump with the stack information of all the threads in it .

2) How SQL Server manages to schedule in the non-preemptive way ?
Answer: Windows OS will not schedule any thread which is running in infinite wait loop and simply ignores it. SQL Server (actually UMS) takes advantage of this and cleverly puts all the threads which it does not want to schedule to infinitely sleep by calling WaitforsingleObject function in an infinite loop .When SQL Server wants the thread to run it simply signal the thread and it comes out of the sleep modes .Its the Windows which then schedules the thread .Its important to know that UMS schedules only ONE SQl Server thread per CPU .However , there is an exception to this . There are moments where to complete a task the thread leaves the SQL Server scheduler and goes to preemptive mode scheduling . For example using xp_cmdshell to open notepad or running an extended stored procedure that deals with filesystem (like reading a file) or a linked server query
etc.In that situation , you will see more than one thread on a single CPU in runnable status .That is because one thread is scheduled via UMS\SOS and the other one directly via OS scheduling.

Let me show you a Demo since my Laptop has only one dual core processor (Its SQL Server 2008 RTM):
Lets first run a simple query and find runnable and sleeping threads Select Status ,* from sysprocesses where status not in ('background')

you will notice that all the SPIDS will show you the status of sleeping and there is only one SPID that will be showing you the status of runnable .Its waittype will be PREEMPTIVE_os_WAITFORSINGLEOBJECT .Notice that only runnable SPID has a KPID associated with it .This KPID is nothing but the worker thread associated with the SPID.You can run it a few times but the output will not change except the KPID which means that one thread is yielding to another after the context switching .The reason why we see runnable state and not running ,because by the time we get the query output the thread again goes to runnable state.You might also see other runnable or suspended SPIDs but its because they are running in preemptive mode .

Now lets open another Query window and execute the same command there .
Select Status ,* from sysprocesses where status not in ('background'). This time its SPID 53 (current SPID on my machine) which is showing us the Runnable state while SPID 52 (the previous SPID)is now sleeping .

Lets do one more experiment.Open a new query window (SPID 51 in my case) and run select @@servicename around 1000 times . Come back to SPID 53 window and notice if the runnable state is showing for SPID 53 or SPID 51.You will notice that we have the SPID 51 doing its task .But why is it showing as sleeping while CPU value is still increasing ? The reason is that when we run the query via SPID 53 , during (and only) that time 51 shows as sleeping because 53 needs to run . So thread related to 51 yields voluntarily. when this query finishes , SPID 51 again picks up , but we cant see that since we have only one processor :) ...

Anyway , let me show you the small test when the SQL Server thread goes preemptive .We have 2 Query windows.One with sysprocesses query (SPID 53) and one with calling xp_readerrorlog 100 times (SPID 51).I further modified my sysprocesses query by filtering sleeping SPIDs.

Select Status ,* from sysprocesses where status not in ('background','sleeping') .Lets run the query through SPID 51 and then by SPID 53.Notice that you have 2 runnable SPIDs now . Thats because SPID 51 is scheduled by OS and not SQL OS \UMS .

Have a nice day and Happy Learning !!!

Wednesday, April 27, 2011

Finding optimal number of CPUs for a given long running CPU intensive queries (except OLAP queries)

Hi Guys ,

Hope this article will help you in some or the other way one day :) .....

This small article is applicable for finding optimal number of CPUs for long running CPU intensive queries/workload that doesn’t frequently wait for other resources and is not applicable if your queries/workload is often waiting for resources (like I/Os, Locks, Latches etc.) without consuming CPU in a stretch .it can also provide information on uneven CPU load across NUMA nodes and uneven CPU load within same NUMA node (load_factor effect).
It is recommended to analyze Windows Performance Monitor Counters for monitoring CPU pressure. Processor utilization greater then 75% to 80% indicates CPU pressure. Using Windows Performance Monitor should be the 1st step, the procedure suggested in this article should be considered as an additional step.
Further ,it is very important to find ways to optimize the queries/workload by tuning the database schema before attempting to add additional CPUs.

When a customer asks you: I am running a resource consuming SQL job and it takes x amount of time, how can I reduce the time so the SQL job completes sooner, can I add more CPUs ? if yes, how many ?
When you see CPU pressure, there are 2 options: you can either upgrade to faster CPUs or add additional CPUs [assuming that the queries are well tuned and normalized]. Upgrading to faster CPU will always help. Adding additional CPUs may not always help the SQL job to run faster unless that SQL job can take advantage of additional CPUs [read Max Degree of parallelism form BOL]. If the customer already has the fastest CPUs available in the market then they have to wait for the next release of faster CPUs. One more choice woiuld be to add additional CPUs and see if it helps, the below procedure will help you identify if this is the case.
This method calculates total user waits for CPU during the SQL workload and suggests additional CPUs if necessary. If CPU usage is at 100%, but no one waited for CPU during the workload, then adding additional CPU will not help; this is the basics of this calculation.
Current recommendations that are available on this topic calculates ‘signal wait time’ to ‘wait time’ ratio to suggest CPU pressure – but this cannot help one easily identify number of additional CPUs necessary.

When concurrent users apply simultaneous CPU intensive workload, there could be CPU pressure. We can conclude presence of CPU pressure when at any given moment during this time period at least one or more user tasks waited for CPU resource.
In this case one can run the below query to find out how many CPU on an average will help to scale(out) the workload better. It might be more informative to collect the below information in short time intervals (many samples) than just once to understand during which time of the workload application the CPU pressure was the most. Single sample will lead to average additional CPUs necessary for the entire workload duration.
1. Reset Wait Stats
dbcc sqlperf('sys.dm_os_wait_stats', clear)
2. Apply workload (you can find sample workload query at the end of this article, you need to execute the sample workload query simultaneously in many sessions to simulate concurrent user tasks).
3. Run the below query to find Additional CPUs Necessary – it is important to run the query right after the workload completes to get reliable information.

select round(((convert(float, ws.wait_time_ms) / ws.waiting_tasks_count) / (convert(float, si.os_quantum) / si.cpu_ticks_in_ms) * cpu_count), 2) as Additional_CPUs_Necessary,
round((((convert(float, ws.wait_time_ms) / ws.waiting_tasks_count) / (convert(float, si.os_quantum) / si.cpu_ticks_in_ms) * cpu_count) / hyperthread_ratio), 2) as Additional_Sockets_Necessary
from sys.dm_os_wait_stats ws cross apply sys.dm_os_sys_info si where ws.wait_type = 'SOS_SCHEDULER_YIELD'

When you have 2 CPUs and you run the sample workload with just 1 or 2 concurrent sessions – you will see no recommendation for addition additional CPUs – unless there is unbalanced user task distribution across CPUs. On the other hand if you run the workload with 4 concurrent sessions – you will notice the query suggests you to add 2 additional CPUs. If you run with 6 concurrent sessions – you will notice the query suggests you to add 4 additional CPUs.
If each workload runs in parallel (MAXDOP not 1), then you will notice additional CPU suggestion, you need to be careful in this case. For example with 2 CPUs when you run the workload (in parallel, MAXDOP 0/2) with 2 concurrent sessions, you will notice the suggestion to add 2 additional CPUs – this just indicates the workload is more scalable with more CPUs – parallel query execution as you can imagine can consume as many CPUs as you have and can consume more!!
The results are not reliable when other applications are running in the system. Also the results might be incorrect on a hyper threading enabled system.

When there are more user tasks concurrently needing CPU than available CPU, the excess user tasks will wait for CPU (there are exceptions when the workload is not evenly distributed across CPUs). In this case each user task uses its quantum, then goes into a wait state (waiting for CPU with wait_type SOS_SCHEDULER_YIELD. sys.dm_exec_requests doesn’t show this wait type, probably by design to avoid showing user tasks in wait state when they are waiting for CPU. But sys.dm_os_wait_stats will include these waits) until all other runnable user tasks have used their quantum. If one measures how many tasks went into this wait state and for how long while the workload was applied – it is possible to calculate the CPUs necessary to scale the workload better.
runnabkle_task_count from the dm_os_schedulers is also a indication for CPU pressure, but it is just a probe – one cannot reasonably conclude the number of CPUs necessary for a given workload.

There is an exception(for OLTP like workload) where a user tasks doesn’t consume all of its quantum(goes into some other wait state before the quantum expires, waiting for I/Os, Locks, Latches etc.) in a stretch, but continues to run in a loop using CPU without using its full quantum(You know what quantum is ...right :D). The method mentioned here cannot calculate the necessary additional CPUs in this case.. Most common example is short transactions using part of its quantum and starts WRITELOG waits and continues in a loop – inserts using implicit transactions in a loop is a typical example.

Sample Workload:
Create the below table before running the query to generate CPU intensive workload.
Serial Workload:
select max(t1.c2 + t2.c2) from tab7 t1 cross join tab7 t2 option (maxdop 1)
Parallel Workload:
select max(t1.c2 + t2.c2) from tab7 t1 cross join tab7 t2 
create table tab7 (c1 int primary key clustered, c2 int, c3 char(2000))
begin tran
declare @i int
set @i = 1
while @i <= 5000
insert into tab7 values (@i, @i, 'a')
set @i = @i + 1
commit tran

Happy Learning !!!