Highlighted
New Member.
1598 views

DP7.0 - Time Sync on Cell Manager Causes IPC Read Error

Jump to solution

Since 2 weeks my dataprotector is having alot of "IPC Read Error: Connection reset bij peer"s.

After doing some research on the servers reporting these backup-failures i've come to a conclusion that every time the "Cell Manager" does a Time Sync with our local NTP-server (i assume), all dataprotector backup jobs lose their connection to the hosts.

 

Error reported on the Cell Manager:

Level: Information

Source: Kernel-General

General: The system time has changed to 2014-04-29T07:34:25.610000..Z From 2014-04-29T07:33:17.94217400Z.

 

Error report in dataprotector:

[Major] From: BSM@server** "Backup Job" Time: 4/29/2014 9:34:25 AM Lost connection to the VBDA named "location" on host "server". Ipc subsystem reports: "IPC Read Error System error: [10054] Connection reset by peer".

 

Does anyone know why the Cell Manager resyncs his time so often? or beter, why it differs so much... resyncing is done every 1024 seconds as i can find in the w32tm cmd.

 

The Cell Manager is an Virtual VM server, using an NTP server inside our domain. I know i can make the "retry aborted connection -time value longer but only this server seems to be having this issue. and doesnt give me a solution on Sheduled automated backup copy jobs. I could also manage this server by GPO and turn off time sync but that is not realy a good idea as the problem is not solved.

 

DP: 7.0

WinServ 2008 R2 SP1

Tags (1)
0 Likes
1 Solution

Accepted Solutions
Highlighted
Visitor.

Hi,

 

now that is quite a strange effect. Please note that you are jumping more than a minute into the future, but you stay in sync with your NTP server. BTW, the cmd window for the stripchart should be resized to something like 104 columns, so it doesn't wrap around - it only then becomes apparent why it's called a stripchart 😉

 

Now usually, when your clock does sizable jumps, you see it end up far away from the NTP source, the asterisk in the chart is then off-center, and you can follow it moving back to the center (sadly you cannot have this chart adjusted to subsecond precision, but that's not much of a problem with bad drifts or external things messing with the clock). But your clock just loses a little precision (it goes ahead by some 25ms regarding offset to the NTP server which is nothing strange in a virtual environment), allthewhile jumping more than a minute in the same instant.

 

What could cause that? I see two things:

  • Your NTP server itself is jumping at the same moment. This is extremely unlikely, as it should not give out time just 2s after such a jump, given it cannot know its's sync state very well. But it's not impossible, for instance if the NTP server is itself virtualized, is pushed to the future in the same moment that your VM is, and gives out a timestamp before the NTP daemon actually recognizes it was just hit over the head.
  • Your VM is actually frozen for more than a minute, then revives and has its clock corrected by the hypervisor so it just happens to still be in sync with the NTP server - but it lost more than a minute. Such "stop the universe" effects are possible in a virtualization environment, just consider somebody pressing the pause button for that VM, or the pause being a side effect of some automatism (like VMotion).

The second cause would also explain the IPC breaking, as the machine was off the net for more than a minute from the point of view of everybody not subject to the same freeze.

 

HTH,

Andre.

View solution in original post

6 Replies
Highlighted
Visitor.

Hi,

 

jumping by more than one minute into the future can occasionally confuse software that is built around a wallclock-based timer queue, which apparently is the case for the DP IPC. That being said, if you are NTP synced and that is working properly (in a Windows domain it usually should, I've found that non-domain-members use a way too relaxed polling though, something like once a week, which easily allows them to drift away like crazy), e.g. when you really poll within 1200s, you should never accrete a drift of that size, unless there is something external messing with your clock. The first guess would be the VM is configured to sync time from its host (in VMware tools), and the host is free-running and drifted away by a minute.

 

Watch your clock using w32tm /stripchart /computer:my-ntp-server to see what's going on.

 

HTH,

Andre.

0 Likes
Highlighted
New Member.

I managed to log this error with the stripchart command when the problem occured.

Notice the time diffences 10:08:40 jump to 10:10:04. then the error occurs and resyncs his time resulting in all backups to interrupt.

Tags (4)
0 Likes
Highlighted
Visitor.

Hi,

 

now that is quite a strange effect. Please note that you are jumping more than a minute into the future, but you stay in sync with your NTP server. BTW, the cmd window for the stripchart should be resized to something like 104 columns, so it doesn't wrap around - it only then becomes apparent why it's called a stripchart 😉

 

Now usually, when your clock does sizable jumps, you see it end up far away from the NTP source, the asterisk in the chart is then off-center, and you can follow it moving back to the center (sadly you cannot have this chart adjusted to subsecond precision, but that's not much of a problem with bad drifts or external things messing with the clock). But your clock just loses a little precision (it goes ahead by some 25ms regarding offset to the NTP server which is nothing strange in a virtual environment), allthewhile jumping more than a minute in the same instant.

 

What could cause that? I see two things:

  • Your NTP server itself is jumping at the same moment. This is extremely unlikely, as it should not give out time just 2s after such a jump, given it cannot know its's sync state very well. But it's not impossible, for instance if the NTP server is itself virtualized, is pushed to the future in the same moment that your VM is, and gives out a timestamp before the NTP daemon actually recognizes it was just hit over the head.
  • Your VM is actually frozen for more than a minute, then revives and has its clock corrected by the hypervisor so it just happens to still be in sync with the NTP server - but it lost more than a minute. Such "stop the universe" effects are possible in a virtualization environment, just consider somebody pressing the pause button for that VM, or the pause being a side effect of some automatism (like VMotion).

The second cause would also explain the IPC breaking, as the machine was off the net for more than a minute from the point of view of everybody not subject to the same freeze.

 

HTH,

Andre.

View solution in original post

Highlighted
New Member.

I have found the problem!

 

It had something to do with vRanger and Snapshots that were made in VMware.

 

I was looking in the "Tasks and Events"-log in VMware on the server and apparenltly vRanger was taking a snapshot at that time my server lost connection. Remove all snapshots that were still on the server resolved the problem.

 

Server Tasks & Events Log: Virtual machine disks consolidation succeeded - Remove Snapshot.

 

When this task is logged, at exactly the same time, the time-sync error is reported on the windows server.

 

Thanks André for the help!

0 Likes
Highlighted
Micro Focus Expert
Micro Focus Expert

This can happen depending on what information is included in a VMware snapshot regardless of the backup software used. If you capture VM memory, the VM will freeze for some time and this will cause network connection to terminate.

 

Regards,

Sebastian

---
Please use the Like button below, if you find this post useful.
Highlighted
New Member.

The snapshot also included the virtual memory, which is 20GB, so i can imagine that this would take a few seconds more then a normal VM with only 4 - 6 GB virtual memory.

 

Thank you everyone for the support! Much appreciated!

0 Likes
The opinions expressed above are the personal opinions of the authors, not of Micro Focus. By using this site, you accept the Terms of Use and Rules of Participation. Certain versions of content ("Material") accessible here may contain branding from Hewlett-Packard Company (now HP Inc.) and Hewlett Packard Enterprise Company. As of September 1, 2017, the Material is now offered by Micro Focus, a separately owned and operated company. Any reference to the HP and Hewlett Packard Enterprise/HPE marks is historical in nature, and the HP and Hewlett Packard Enterprise/HPE marks are the property of their respective owners.