I am sure those of you who have been through the replication latency issues know that these issues are not easy to crack .Gaining expertise in Replication especially issues related to latency cannot be learned or taught in a class.
Mostly, issues related to latency are due to :
1: Blocking on DIST/PUB/SUB
2: High Resource consumption : CPU,Storage,Memory
3: Network issues
4: Huge size of Distribution database (msrepl_transactions and msrepl_comands)
5: Transaction log of publisher is huge (too many VLFs) causing log reader latency .
-I am sharing this one with you where actually there was no latency due to any of these as I
mentioned above.
-There was no blocking
-CPU,memory and Diskes were doing good .
-There was no network issue
-We shrunk the publisher database which did not help us at all .
-Cleanup job was running fine .We ran update stats with full scan on msrepl_transactions and
msrepl_comands which did not help either .
When latency started piling up , we first had a look at the log reader and distributor agent history in distribution database but could not get much information due to another issue (out of scope of this post) because of which the log reader history was not showing up and distributor history was showing that almost no data is coming to subscriber ..Normally every 5 mins the throughput of logreader and distribution agent threads (reader and writer threads) are written in to mslogreader_history and msdistribution_history system tables .But we were getting false entries there .
We had no choice but to take the verbose log for distribution and log reader agents .From the logs it was clear that the log reader agent was delivering more than 1000 commands/sec while distribution agent was delivering less than 100 cmds/sec .
We were totally clue less because :
- Log Reader was running at its usual speed .So there was no issue with the Publisher .Nor there was any issue in pumping the data to Distributor.The log Reader Verbose log cleared it .So this was ruled out .
- Data from Distributor to Subscriber was slow .But we were not able to find out whether its slow in reading the data from distribution db or slow in pumping the data to the subscriber .db .What we could clearly see was that there was almost no activity on the subscriber .
-No Resource bottlenecks .Its just that something was stopping the data to flow from distributor to subscriber .This was like when we see an hour glass where there is a lot of sand at the top but the hole is so small that very less can go out of it .
I then saw the distribution verbose logs again and found that there were very frequent entries related to committing the transactions .The 5 similar sentences in 5 continuous lines were not very much clear but I could feel that something related to committing transactions was happening .But still very unclear .These entries were repeating in a group of 5-6 very frequently .The time consumption was around 2 seconds for each spell of 5-6 entries .
This gave us some hint and we immediately jumped to the Distribution agent profile to see if someone has not modified any setting which might be causing this issue.We did a mistake again because we just right clicked replication >> Distributor properties >> agent profiles >> Distribution agent .This was looking fine .
However, one of our colleagues opened up the profile of the distribution agent that was giving us the issue .This was the right way guys .We needed to do this . We found that -CommitBatchSize (default Value 100) and -CommitBatchThreshold(default Value 1000) values were changed to 10 each .
We then changed it back to the default and recycled the Distribution Agent .Thats it , the story ends here .
Due to this the rate at which the transactions were delivered at 1 cmd/transaction .
Even though it was a small setting that caused us slogging for hours , but the experience of working on the replication latency issue when there was actually no resource bottleneck (I call it as false latency) was amazing ..
Cheers and Happy Learning ..
Mostly, issues related to latency are due to :
1: Blocking on DIST/PUB/SUB
2: High Resource consumption : CPU,Storage,Memory
3: Network issues
4: Huge size of Distribution database (msrepl_transactions and msrepl_comands)
5: Transaction log of publisher is huge (too many VLFs) causing log reader latency .
-I am sharing this one with you where actually there was no latency due to any of these as I
mentioned above.
-There was no blocking
-CPU,memory and Diskes were doing good .
-There was no network issue
-We shrunk the publisher database which did not help us at all .
-Cleanup job was running fine .We ran update stats with full scan on msrepl_transactions and
msrepl_comands which did not help either .
When latency started piling up , we first had a look at the log reader and distributor agent history in distribution database but could not get much information due to another issue (out of scope of this post) because of which the log reader history was not showing up and distributor history was showing that almost no data is coming to subscriber ..Normally every 5 mins the throughput of logreader and distribution agent threads (reader and writer threads) are written in to mslogreader_history and msdistribution_history system tables .But we were getting false entries there .
We had no choice but to take the verbose log for distribution and log reader agents .From the logs it was clear that the log reader agent was delivering more than 1000 commands/sec while distribution agent was delivering less than 100 cmds/sec .
We were totally clue less because :
- Log Reader was running at its usual speed .So there was no issue with the Publisher .Nor there was any issue in pumping the data to Distributor.The log Reader Verbose log cleared it .So this was ruled out .
- Data from Distributor to Subscriber was slow .But we were not able to find out whether its slow in reading the data from distribution db or slow in pumping the data to the subscriber .db .What we could clearly see was that there was almost no activity on the subscriber .
-No Resource bottlenecks .Its just that something was stopping the data to flow from distributor to subscriber .This was like when we see an hour glass where there is a lot of sand at the top but the hole is so small that very less can go out of it .
I then saw the distribution verbose logs again and found that there were very frequent entries related to committing the transactions .The 5 similar sentences in 5 continuous lines were not very much clear but I could feel that something related to committing transactions was happening .But still very unclear .These entries were repeating in a group of 5-6 very frequently .The time consumption was around 2 seconds for each spell of 5-6 entries .
This gave us some hint and we immediately jumped to the Distribution agent profile to see if someone has not modified any setting which might be causing this issue.We did a mistake again because we just right clicked replication >> Distributor properties >> agent profiles >> Distribution agent .This was looking fine .
However, one of our colleagues opened up the profile of the distribution agent that was giving us the issue .This was the right way guys .We needed to do this . We found that -CommitBatchSize (default Value 100) and -CommitBatchThreshold(default Value 1000) values were changed to 10 each .
We then changed it back to the default and recycled the Distribution Agent .Thats it , the story ends here .
Due to this the rate at which the transactions were delivered at 1 cmd/transaction .
Even though it was a small setting that caused us slogging for hours , but the experience of working on the replication latency issue when there was actually no resource bottleneck (I call it as false latency) was amazing ..
Cheers and Happy Learning ..