TechOpsGuys.com Diggin' technology every day

August 7, 2012

Adventures with vCenter, Windows and expired Oracle passwords

Filed under: General — Tags: , , — Nate @ 7:39 pm

Today’s a day that I could have back – it was pretty much a waste/wash.

I’m not a windows person by trade of course, but I did have an interesting experience today. I write this in the hopes that perhaps it can save someone else the same pain.

Last night I kicked off some Windows updates on a vCenter server, done it a bunch of times before never had an issue. There was only about 6-10 updates to install. It installed them, then rebooted, and was taking a really long time to complete the post install stuff, after about 30mins I gave up and went home. It’s always come back when it’s done.

I forgot about it until this morning when I went to go do stuff with vCenter and could not connect. Then I tried to remote desktop into the system and could not(tcp port not listening). So I resorted to logging in via VMware console. Tried resetting remote desktop to no avail. I went to control panel to check on windows update, and the windows update control panel just hung. I went to the ‘add/remove programs’ thing to roll back some updates and it hung while looking for the updates.

I tried firing up IE9, and it didn’t fire, it just spun an hourglass for a few seconds and stopped. I scoured the event logs and there was really nothing there – no errors. I was convinced at this time an OS update went wrong, I mean why else would something like IE break ? There was an IE update as part of the updates that were installed last night after all.

After some searches I saw some people comment on how some new version of Flash was causing IE to break, so I went to remove flash (forgot why it was installed but there was a reason at the time), and could not. In fact I could not uninstall anything, it just gave me a generic message saying something along the lines of “wait for the system to complete the process before uninstalling this”.

I came across a windows tool called System Update Readiness Tool which sounded promising as well, I was unable to launch IE of course, I did have firefox and could load the web page but was unable to download the software without Firefox hanging(!?). I managed to download it on another computer and copy it over the network to the affected server’s HD. But when I tried to launch it – sure enough it hung too almost immediately.

Rebooting didn’t help, shut down completely and start up again – no luck. Same behavior. After consulting with the IT manager who spends a lot more time in Windows than me we booted to safe mode – came right up. Windows update is not available in safe mode, most services were not started. But I was able to get in and uninstall the hot fix for IE. I rebooted again.

At some point along the line I got the system to where I could remote desktop in, windows update looked ok, IE loaded etc. I called the IT manager over to show him, and decided to reboot to make sure it was OK only to have it break on me again.

I sat at the post install screen for the patches (Stage 3 of 3 0%) for about 30 minutes, at this point I figure I better start getting prepared to install another vCenter server so I started that process in parallel, talked a bit with HP/Vmware support and I shut off the VM again and rebooted – no difference just was sitting there. So I rebooted again into safe mode, and removed the rest of the patches that were installed last night, and rebooted again into normal mode and must’ve waited 45 minutes or so for the system to boot – it did boot eventually, got past that updates screen. But the system was still not working right, vCenter was hanging and I could not remote desktop in.

About 30 minutes after the system booted I was able to remote desktop in again, not sure why, I kept poking around, not making much progress. I decided to take a VM snapshot (I had not taken one originally but in the grand scheme of things it wouldn’t of helped), and re-install those patches again, and let the system work through whatever it has to work through.

So I did that, and the system was still wonky.

I looked and looked – vCenter still hanging, nothing in the event log and nothing in the vpx vCenter log other than stupid status messages like

2012-08-08T01:08:01.186+01:00 [04220 warning 'VpxProfiler' opID=SWI-a5fd1c93] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:12.535+01:00 [04220 warning 'VpxProfiler' opID=SWI-12d43ef2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:23.884+01:00 [04356 warning 'VpxProfiler' opID=SWI-f6f6f576] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:35.234+01:00 [04220 warning 'VpxProfiler' opID=SWI-a928e16] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:46.583+01:00 [04220 warning 'VpxProfiler' opID=SWI-729134b2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:57.932+01:00 [04328 warning 'VpxProfiler' opID=SWI-a395e0af] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:09.281+01:00 [04220 warning 'VpxProfiler' opID=SWI-928de6d2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:20.631+01:00 [04328 warning 'VpxProfiler' opID=SWI-7a5a8966] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:32.058+01:00 [04220 warning 'VpxProfiler' opID=SWI-524a7126] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:43.804+01:00 [04328 warning 'VpxProfiler' opID=SWI-140d23cf] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:55.551+01:00 [04356 warning 'VpxProfiler' opID=SWI-acadf68a] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:07.297+01:00 [04328 warning 'VpxProfiler' opID=SWI-e42316c] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:19.044+01:00 [04356 warning 'VpxProfiler' opID=SWI-3e976f5f] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:30.790+01:00 [04328 warning 'VpxProfiler' opID=SWI-2734f3ba] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms

No errors anywhere, I believe I looked at the tomcat logs a few times and there was no logs for today.

Finally I dug into the tomcat logs from last night and came across this –

Aug 6, 2012 11:27:30 PM com.vmware.vim.common.vdb.VdbODBCConfig isConnectableUrl
SEVERE: Unable to get a connection to: jdbc:oracle:thin:@//DB_SERVER:1521/DB_SERVER as username=VPXADMIN due to: ORA-28001: the password has expired

I had encountered a password expiry on my sys account a few weeks ago, but didn’t really think much about it at the time. Anyways I reset the password and vCenter was able to start. I disabled password expiry per this page (I have used Oracle 10G and a little of 8/9i and never recall having password expire issues), which says defaults were changed in 11G and passwords do expire now.

I have had vCenter fail to start because of DB issues in the past – in fact because vCenter does not properly release locks on the Oracle DB when it shuts down the easiest workaround is to restart Oracle whenever I reboot the vCenter server (because vCenter is the only thing on the Oracle DB it’s just a simpler solution). When vCenter fails in this way it causes no issues to the rest of the OS. Just an error message in the event log saying vCenter failed to start, and a helpful explanation as to why –

Unable to get exclusive access to vCenter repository.   Please check if another vCenter instance is running against the same database schema.

What got me, even now is how the hell did this expired password cascade into Internet Explorer breaking, remote desktop breaking, windows update breaking, etc ? My only guess is that vCenter was perhaps flooding the system with RPC messages causing other things to break. Again – there was no evidence of any errors in the event log anywhere. I even called a friend who works at Microsoft and deploys hundreds of Windows servers for a living (he works as a Lab Manager), hoping he would have an idea. He said he had seen this behavior several times before but never tried to debug it, he just wiped the system out and reinstalled. I was close to doing that today, but fortunately eventually found a solution, and I guess you could say I learned something in the process ?

I don’t know.

I have not seriously used windows since the NT4 days (I have used it casually on the desktop and in some server roles like this vCenter system), why I stopped using it, well there was many reasons, I suppose this was sort of a reminder. I’m not really up to moving to the Linux vCenter appliance yet it seems beta-ish, if I ever get to move to that appliance before I upgrade to KVM (at some point, no rush). I have a very vague memory of experimenting one time on NT4, or maybe it was 3.51, where I decided to stop one/more of the RPC services to see what would happen. Havok, of course. I noticed one of the services vCenter depends upon, the DCOM Server Process Launcher, seems similar of importance in Windows 2008, though 2008 smartly does not allow you to stop it, I chuckled when I saw the Recovery Action for this service failure is Restart the Computer. But in this case the service was running… I looked for errors for it in the event log as well and there were none.

4 Comments

  1. Oracle DB user passwords expired.
    vCenter Application “broke” the Server OS

    Well i guess it makes sense to blame the Core OS for the above.

    Comment by pannivas — August 8, 2012 @ 6:56 am

  2. thanks for the comment!

    Since I spent a good 90% of the troubleshooting time troubleshooting the OS, and all recommendations were to re-install the OS I think it’s fair to put most of the blame there. How one application that is failing to start for a minor reason can wreck so much havok (and there being no event logs from anywhere saying there is an issue) – it’s not as if it’s a driver failing or faulty hardware causing crashes (though I’ve been trying for 2 months to diagnose why my home windows 7 system BSOD’s on boot about 50% of the time when XP works fine and Linux has worked fine for over a year).

    Comment by Nate — August 8, 2012 @ 7:44 am

  3. If an application fails to log its errors in the event viewer then it’s the application fault and not the OS fault.
    If vCenter couldn’t give you a meaningful warning or log event then again it is not the OS problem.
    You were misled to believe that the OS was the problem mainly due to the fact that you “hate” the specific OS to begin with, I have seen this happening with many Linux Admins a lot of times and it will never get old (the same could be said for Win Admins); thus you spend a lot of time troubleshooting the OS which was not the main problem to begin with.
    And the issue you faced is defiantly specific to your setup vCenter and Oracle because I have seen this issue with DB connection with vCenter and MSSQL and only the vCenter services failed to start. The core OS was working perfectly fine.
    I suggest you contact VMware and open a ticket with them.
    I have seen a lot of Linux applications giving a kernel panic does that make Linux a bad OS or does it make the Application bad?
    Again a bad un-singed driver can easily give BSOD to any Windows OS. I guess technology people already know this?
    A faulty hardware especially RAM or GPU could easily give BOSD on Desktop OS and even Server OS.

    Comment by pannivas — August 8, 2012 @ 11:04 am

  4. thanks again for the insight though I disagree.

    A failing application should not render the rest of the operating system unusable to the point of forcing a re-installation.

    Add to that the fact that other services were failing (e.g. remote desktop was not listening on the tcp port) with no errors in event log as to why. No messages why IE failed to launch, not even a crash dialog it just did nothing. No errors why windows update failed to launch, etc.

    As far as the OS was concerned it was behaving normally(when it was not)

    I don’t hate windows myself. I did, a long time ago (one of the reasons I moved off of it). But as the years went on I have become more neutral on it, I don’t really care either way. In fact I go out of my way every day to use windows(in a VM) to use VPN because the VPN system doesn’t support Linux. Though I don’t use it in a serious capacity – that is I’m by no means an expert anymore. While most everyone else that works near me uses a Mac and almost universally refuses to use windows. (I refuse to use Mac – I’d rather use windows than Mac)

    I don’t hate microsoft either (which may shock you too) – again I used to back when they were perceived as more of a threat(90s) but they have diminished over the past decade and I have actually gone and bought several pieces of their software (the first since Windows 95 if I recall right) including 2 copies of Windows 7, Visio Professional, and I bought a XP license too (for some older games – the XP license seemed kind of shady as it was a CD along with a Dell OEM license sticker but it worked). I’ve spent more on Microsoft software than Linux since I started using it. Visio is a good product. Windows 7 seems alright (though I’m old school and prefer the more classical feel of XP/2k/NT). I wasn’t happy about having to buy Windows 7 Pro (when I had Home) in order to use my dual socket quad core box, when Windows 7 home works fine on a single socket 8 core CPU.

    As to why my home system is BSODing, – the Windows 7 BSOD says “Hardware problem”. Though after fairly extensive testing I have been unable to trace what the issue is. If I can get Windows 7 to boot it is solid as a rock – until I need to reboot again (updates, or whatever). Which is very confusing to me. I suspected it too could be memory and ran a bunch of tests and came up empty so far (haven’t turned the system on in 2 months now so hasn’t been a priority – I rebuilt it mainly to play some games but have sort of lost interest since). Though it did run Linux for more than a year (and it does have ECC RAM so that helps – it’s a dual socket system), and Linux is notorious for flaking out with bad memory, long before Windows was picky about it.

    Linux has many faults too of course – on the desktop there have been big conflicts on things like Unity, Gnome 3 etc. I outright refuse to use those products so have not been affected by them(and thus not written about them). I’m still using Ubuntu 10.04 on my desktop+laptop (Debian on my personal servers), with GNOME 2, and am content with that for as long as I can keep it running.

    See some of my other articles where I do some Linux ripping
    http://www.techopsguys.com/2012/06/30/synchronized-reboot-of-the-internet/
    http://www.techopsguys.com/2012/06/17/the-old-linux-abi-compatibility-argument/

    Both are fairly recent.

    Though I struggle to think of any time, on any operating system (windows included) where a failing application would prompt a re-install of the OS because it became unusable. Drivers are an exception(and I wouldn’t consider them applications anyways). I do recall re-installing windows about once every 6 months back in the 90s when I used it more heavily. Though I don’t recall that trigger being any single application it was more of a gradual decline in stability.

    thanks again

    Comment by Nate — August 8, 2012 @ 11:46 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress