TechOpsGuys.com Diggin' technology every day

October 3, 2012

Oracle doesn’t care if Sun hardware goes to zero

Filed under: General — Tags: — Nate @ 11:08 am

I saw an interesting interview with Larry over at Oracle yesterday. It was pretty good, it was nice to see him being honest, he wasn’t trying to sugar coat anything.

He says they have two hardware businesses – one that they care about (engineered systems), and another one that they don’t (commodity x86 stuff mainly though I have to think it encompasses everything that is not the engineered integrated products). He also says they don’t care if/when the Sun hardware business goes to $0. Pretty brutal.

This is somewhat contrary to some comments I saw somewhat recently where people were claiming Oracle was heavily discounting their software and keeping Sun hardware discounts at 0 so they could show higher revenue on the hardware side.

Given that is there any hope for what’s left of Pillar ? I suspect not, I suppose that funky acquisition of Pillar that Oracle did a while back probably won’t result in anyone getting a dime, and may or may not allow Larry to recoup his investment in the company, sad.

August 7, 2012

Adventures with vCenter, Windows and expired Oracle passwords

Filed under: General — Tags: , , — Nate @ 7:39 pm

Today’s a day that I could have back – it was pretty much a waste/wash.

I’m not a windows person by trade of course, but I did have an interesting experience today. I write this in the hopes that perhaps it can save someone else the same pain.

Last night I kicked off some Windows updates on a vCenter server, done it a bunch of times before never had an issue. There was only about 6-10 updates to install. It installed them, then rebooted, and was taking a really long time to complete the post install stuff, after about 30mins I gave up and went home. It’s always come back when it’s done.

I forgot about it until this morning when I went to go do stuff with vCenter and could not connect. Then I tried to remote desktop into the system and could not(tcp port not listening). So I resorted to logging in via VMware console. Tried resetting remote desktop to no avail. I went to control panel to check on windows update, and the windows update control panel just hung. I went to the ‘add/remove programs’ thing to roll back some updates and it hung while looking for the updates.

I tried firing up IE9, and it didn’t fire, it just spun an hourglass for a few seconds and stopped. I scoured the event logs and there was really nothing there – no errors. I was convinced at this time an OS update went wrong, I mean why else would something like IE break ? There was an IE update as part of the updates that were installed last night after all.

After some searches I saw some people comment on how some new version of Flash was causing IE to break, so I went to remove flash (forgot why it was installed but there was a reason at the time), and could not. In fact I could not uninstall anything, it just gave me a generic message saying something along the lines of “wait for the system to complete the process before uninstalling this”.

I came across a windows tool called System Update Readiness Tool which sounded promising as well, I was unable to launch IE of course, I did have firefox and could load the web page but was unable to download the software without Firefox hanging(!?). I managed to download it on another computer and copy it over the network to the affected server’s HD. But when I tried to launch it – sure enough it hung too almost immediately.

Rebooting didn’t help, shut down completely and start up again – no luck. Same behavior. After consulting with the IT manager who spends a lot more time in Windows than me we booted to safe mode – came right up. Windows update is not available in safe mode, most services were not started. But I was able to get in and uninstall the hot fix for IE. I rebooted again.

At some point along the line I got the system to where I could remote desktop in, windows update looked ok, IE loaded etc. I called the IT manager over to show him, and decided to reboot to make sure it was OK only to have it break on me again.

I sat at the post install screen for the patches (Stage 3 of 3 0%) for about 30 minutes, at this point I figure I better start getting prepared to install another vCenter server so I started that process in parallel, talked a bit with HP/Vmware support and I shut off the VM again and rebooted – no difference just was sitting there. So I rebooted again into safe mode, and removed the rest of the patches that were installed last night, and rebooted again into normal mode and must’ve waited 45 minutes or so for the system to boot – it did boot eventually, got past that updates screen. But the system was still not working right, vCenter was hanging and I could not remote desktop in.

About 30 minutes after the system booted I was able to remote desktop in again, not sure why, I kept poking around, not making much progress. I decided to take a VM snapshot (I had not taken one originally but in the grand scheme of things it wouldn’t of helped), and re-install those patches again, and let the system work through whatever it has to work through.

So I did that, and the system was still wonky.

I looked and looked – vCenter still hanging, nothing in the event log and nothing in the vpx vCenter log other than stupid status messages like

2012-08-08T01:08:01.186+01:00 [04220 warning 'VpxProfiler' opID=SWI-a5fd1c93] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:12.535+01:00 [04220 warning 'VpxProfiler' opID=SWI-12d43ef2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:23.884+01:00 [04356 warning 'VpxProfiler' opID=SWI-f6f6f576] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:35.234+01:00 [04220 warning 'VpxProfiler' opID=SWI-a928e16] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:46.583+01:00 [04220 warning 'VpxProfiler' opID=SWI-729134b2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:08:57.932+01:00 [04328 warning 'VpxProfiler' opID=SWI-a395e0af] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:09.281+01:00 [04220 warning 'VpxProfiler' opID=SWI-928de6d2] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:20.631+01:00 [04328 warning 'VpxProfiler' opID=SWI-7a5a8966] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:32.058+01:00 [04220 warning 'VpxProfiler' opID=SWI-524a7126] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:43.804+01:00 [04328 warning 'VpxProfiler' opID=SWI-140d23cf] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:09:55.551+01:00 [04356 warning 'VpxProfiler' opID=SWI-acadf68a] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:07.297+01:00 [04328 warning 'VpxProfiler' opID=SWI-e42316c] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:19.044+01:00 [04356 warning 'VpxProfiler' opID=SWI-3e976f5f] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms
2012-08-08T01:10:30.790+01:00 [04328 warning 'VpxProfiler' opID=SWI-2734f3ba] VpxUtil_InvokeWithOpId [TotalTime] took 12000 ms

No errors anywhere, I believe I looked at the tomcat logs a few times and there was no logs for today.

Finally I dug into the tomcat logs from last night and came across this –

Aug 6, 2012 11:27:30 PM com.vmware.vim.common.vdb.VdbODBCConfig isConnectableUrl
SEVERE: Unable to get a connection to: jdbc:oracle:thin:@//DB_SERVER:1521/DB_SERVER as username=VPXADMIN due to: ORA-28001: the password has expired

I had encountered a password expiry on my sys account a few weeks ago, but didn’t really think much about it at the time. Anyways I reset the password and vCenter was able to start. I disabled password expiry per this page (I have used Oracle 10G and a little of 8/9i and never recall having password expire issues), which says defaults were changed in 11G and passwords do expire now.

I have had vCenter fail to start because of DB issues in the past – in fact because vCenter does not properly release locks on the Oracle DB when it shuts down the easiest workaround is to restart Oracle whenever I reboot the vCenter server (because vCenter is the only thing on the Oracle DB it’s just a simpler solution). When vCenter fails in this way it causes no issues to the rest of the OS. Just an error message in the event log saying vCenter failed to start, and a helpful explanation as to why –

Unable to get exclusive access to vCenter repository.   Please check if another vCenter instance is running against the same database schema.

What got me, even now is how the hell did this expired password cascade into Internet Explorer breaking, remote desktop breaking, windows update breaking, etc ? My only guess is that vCenter was perhaps flooding the system with RPC messages causing other things to break. Again – there was no evidence of any errors in the event log anywhere. I even called a friend who works at Microsoft and deploys hundreds of Windows servers for a living (he works as a Lab Manager), hoping he would have an idea. He said he had seen this behavior several times before but never tried to debug it, he just wiped the system out and reinstalled. I was close to doing that today, but fortunately eventually found a solution, and I guess you could say I learned something in the process ?

I don’t know.

I have not seriously used windows since the NT4 days (I have used it casually on the desktop and in some server roles like this vCenter system), why I stopped using it, well there was many reasons, I suppose this was sort of a reminder. I’m not really up to moving to the Linux vCenter appliance yet it seems beta-ish, if I ever get to move to that appliance before I upgrade to KVM (at some point, no rush). I have a very vague memory of experimenting one time on NT4, or maybe it was 3.51, where I decided to stop one/more of the RPC services to see what would happen. Havok, of course. I noticed one of the services vCenter depends upon, the DCOM Server Process Launcher, seems similar of importance in Windows 2008, though 2008 smartly does not allow you to stop it, I chuckled when I saw the Recovery Action for this service failure is Restart the Computer. But in this case the service was running… I looked for errors for it in the event log as well and there were none.

August 1, 2012

Oracle loses 2nd major recent legal battle

Filed under: General — Tags: — Nate @ 5:15 pm

Not long ago, Oracle lost the battle against Google’s Android, and now it seems they have lost the battle with HP on Itanium.

A California court has ruled that Oracle is contractually obligated to produce software for Hewlett-Packard’s Itanium-based servers and must continue to do so for as long as HP sells them.

That’s quite a ruling – for as long as HP sells them. That could be a while! Though I think a lot of damage is already done with Itanium, all of the uncertainty I’m sure prompted a bunch of customers to upgrade to other platforms since they thought Oracle was gone. I suspect it won’t stop either, I think customers will think they will get poor levels of support with Itanium because Oracle is forced to do it kicking and screaming.

Couldn’t of happened to a nicer company (even though I am a long time fan of Oracle DB itself..)

April 20, 2012

Oracle not afraid to leverage Intel architecture

Filed under: Storage — Tags: , — Nate @ 11:28 am

I have bitched and griped in the past about how some storage companies waste their customer’s time, money and resources by not leveraging the Intel/Commodity CPU architecture that some of them tout so heavily.

Someone commented on here in response to my HP SPC-2 results pointing out that the new Oracle 7240 ZFS system has some new SPC-2 results that are very impressive, and today I stumble upon an article from our friends at The Register which talks about a similar 7240 system being tested in the SpecSFS benchmark with equally impressive results.

The main thing missing to me with the NFS results is the inability to provide them over a single file system(not just a global name space as NetApp tries to advertise but truly a single file system), oh, and of course the disclosure of costs with the test.

This 7240 system must be really new, when I went to investigate it recently the product detail pages on Oracle’s own site were returning 404s, but now they work.

I’ll come right out and say it – I’ve always been a bit leery of the ZFS offerings for a true high availability solution, I wrote a bit about this topic a while ago. Though that article focused mainly on people deploying ZFS on cheap crap hardware because they think they can make an equivalent enterprise offering by slapping some software on top of it.

I’m also a Nexenta customer for a very small installation (NAS only, back end is 3PAR). I know Nexenta and Oracle ZFS are worlds apart but at least I am getting some sort of ZFS exposure. ZFS has a lot of really nice concepts, it’s just a matter of how well it works in practice.

For example I was kind of shocked to learn that if a ZFS file system gets full you can’t delete files off of it. I saw one post of a person saying they couldn’t even mount the file system because it was full. Recently I noticed on one of my Nexenta volumes a process that kicks in when a volume gets 50% full. They create a quota’d file system on the volume of 100MB in size, so that when/if the file system is full you can somehow remove this data and get access to your file system again. Quite a hack.

I’ve seen another thread or two about existing Sun ZFS customers who have gotten very frustrated with the lack of support Oracle has given them since Oracle took the helm.

ANYWAYS, back to the topic of exploiting x86-64 architecture. Look at this –

ZFS Storage array base specifications

Clearly Oracle is embracing the processing and memory power that is available to them and I have to give them mad props for that – I wish other companies did the same, the customer would be so much better off.

They do it also by keeping the costs low (relative to the competition anyways), which is equally impressive. Oracle is a company of course that probably likes to drive margins more than most any other company out there, so it is nice to see them doing this.

My main question is – what of Pillar ? What kind of work is being done there? I haven’t noticed anything since Pillar went home to the Larry E mothership. Is it just dieing on the vine? Are these ZFS systems still not suitable for certain situations which Pillar is better at supporting?

Anyways, I can’t believe I’m writing about an Oracle solution twice in the same week but these are both nice things to see come out of one of the bigger players out there.

April 10, 2012

Oracle first to release 10GbaseT as standard ?

Filed under: Networking — Tags: , , — Nate @ 2:21 pm

Sun has had some innovative x86-64 designs in the past, particularly on the AMD front. Of course Oracle dumped AMD a while back, and focus on Intel, despite that their market share continues to collapse (in good part probably because they screwed  over many of their partners from what I recall by going direct with so many customers, among other things).

In any case they launched a new server line up today, which otherwise is not really news since who uses Sun/Oracle x86-64 boxes anyways? But I thought the news was interesting since it seems to include 4 x 10GbaseT ports on board as standard.

Rear of Sun Fire X4170 M3 Server

The Sun Fire X4170 M3 and the X4270 M3 systems both appear to have quad port 10GbaseT on the motherboard. I haven’t heard of any other severs yet that have this as standard. Out of curioisity if you know of others I’d be interested to hear who they are.

The data sheet is kind of confusing, saying it has 4 onboard 10GbE ports but then it says Four 100/1,000/10 Base-T Ethernet ports in the network section below. Of course it was frequent to have 10/100/1000 BaseT before, so after seeing the physical rear of the system it seems convincing that they are using 10GbaseT.

Nice goin’ Oracle.

 

November 14, 2011

Oracle throws in Xen virtualization towel?

Filed under: Virtualization — Tags: , — Nate @ 7:03 am

This just hit me a few seconds ago and it gave me something else to write about so here goes.

Oracle recently released Solaris 11, the first major rev to Solaris in many many years. I remember using Solaris 10 back in 2005, wow it’s been a while!

They’re calling it the first cloud OS. I can’t say I really agree with that, vSphere, and even ESX before that has been more cloudy than Solaris for many years now, and remains today.

While their Xen-based Oracle VM is still included in Solaris 11, the focus clearly seems to be Solaris Zones, which, as far as I know is a more advanced version of User mode linux (which seems to be abandoned now?).

Zones, and UML are nothing new, Zones having been first released more than six years ago. It’s certainly a different approach to a full hypervisor approach so has less overhead, but overall I believe is an outdated approach to utility computing (using the term cloud computing makes me feel sick).

Oracle Solaris Zones virtualization scales up to hundreds of zones per physical node at a 15x lower overhead than VMware and without artificial limits on memory, network, CPU and storage resources.

It’s an interesting strategy, and a fairly unique one in today’s world, so it should give Oracle some differentiation.  I have been following the Xen bandwagon off and on for many years and never felt it a compelling platform, without a re-write. Red Hat, SuSE and several other open source folks have basically abandoned Xen at this point and now it seems Oracle is shifting focus away from Xen as well.

I don’t see many new organizations gravitating towards Solaris zones that aren’t Solaris users already (or at least have Solaris expertise in house), if they haven’t switched by now…

New, integrated network virtualization allows customers to create high-performance, low-cost data center topologies within a single OS instance for ultimate flexibility, bandwidth control and observability.

The terms ultimate flexibility and single OS instance seem to be in conflict here.

The efficiency of modern hypervisors is to the point now where the overhead doesn’t matter in probably 98% of cases. The other 2% can be handled by running jobs on physical hardware. I still don’t believe I would run a hypervisor on workloads that are truely hardware bound, ones that really exploit the performance of the underlying hardware. Those are few and far between outside of specialist niches these days though, I had one about a year and a half ago, but haven’t come across one since.

 

October 6, 2011

Fusion IO enhances MySQL performance further

Filed under: Storage — Tags: , , , — Nate @ 7:12 am

This seems pretty neat. Not long ago Fusion IO announced their first new real product refresh in quite a while which offers significantly enhanced performance.

Today I see another related article that goes into something more specific, from Data Center Knowledge

Fusion-io also announced a new extension to its VSL (Virtual Storage Layer) software subsystem for conducting Atomic Writes in the popular MySQL open source database. Atomic Writes are an operation in which a processor can simultaneously write multiple independent storage sectors as a single storage transaction. This accelerates mySQL and gives new features powered by the flexibility of sophisticated flash architectures. With the new Atomic Writes extension, Fusion-io testing has observed 35 percent more transactions per second and a 2.5x improvement in performance predictability compared to conducting the same MySQL tests without the Atomic Writes feature.

I know that Facebook is a massive user of Fusion IO for their MySQL database farms, I suspect this feature was made for them! Though it can benefit everyone.

My only question would be can this Atomic write capability be used by MySQL when running through the ESX storage layer, or does there need to be more native access from the OS.

About the new product lines, from The Register

The ioDrive 2 comes in SLC form with capacities of 400GB and 600GB. It can deliver 450,000 write IOPS working with 512-byte data blocks and 350,000 read IOPS. These are whopping great increases, 3.3 times faster for the write IOPS number, over the original ioDrive SLC model which did 135,000 write IOPS and 140,000 read IOPS. It delivered sequential data at 750-770MB/sec whereas the next-gen product does it at 1.5GB/sec, around two times faster.
[..]
All the products will ship in November. Prices start from $5,950

The cards aren’t available yet, wonder how accurate those numbers will end up being? But in any case, even if they were over inflated by a large amount  that’s still an amazing amount of I/O.

On a related note I was just browsing the Fusion IO blog which mentions this MySQL functionality as well and saw that Fusion IO was/is showing off a beefy 8-way HP DL980 with 14 HP-branded IO accelerators  at Oracle Openworld

We’re running Oracle Enterprise Edition database version 11g Release 2 on a single eight processor HP ProLiant DL980 G7 system integrated with 14 Fusion ioMemory-based HP IO Accelerators, achieving performance of more than 600,000 IOPS with over 6GB/s bandwidth using a real world, read/write mixed workload.

[..]

the HP Data Accelerator Solution for Oracle is configured with up to 12TB of high performance flash[..]

After reading that I could not help but think how HP’s own Vertica, with it’s extremely optimized encoding and compression scheme would run on such a beast. I mean if you can get 10x compression out of the system(Vertica’s best-case real world is 30:1 for reference), get a pair of these boxes (Vertica would mirror between the two) and you have upwards of 240TB of data to play with.

I say 240TB because of the way Vertica mirrors the data it allows you to store it in a different sort order on the mirror allowing for even faster access if your querying the data in different ways. Who knows – with the compression you may be able to get much better than 10:1 depending on your data.

Vertica is so fast that you will probably end up CPU bound more than anything else – 80 cores per server is quite a lot though! The DL980 supports up to 16 PCI Express slots so even with 14 cards that still leaves room for a couple 10GigE ports and/or Fibre channel or some other form of connectivity other than what’s on the motherboard (which seems to have an optional dual port 10GbE NIC)

With Vertica’s licensing (last I checked) starting in the 10s of thousands of dollars per raw TB (before compression), it falls into the category for me to blow a ton of money on hardware to make it run the best it possibly can (same goes for Oracle – though Standard Edition to a lesser degree). Vertica is coming out with a Community Edition soon which I believe is free, I don’t recall what the restrictions are I think one of them was it was limited to a single server, I don’t recall yet hearing on what the storage limits might be(I’d assume there would be some limit maybe half a TB or something)

June 29, 2011

Oracle picks up Pillar

Filed under: Storage — Tags: , — Nate @ 1:56 pm

Most people have been expecting this for a long time, and have wondered why it didn’t happen sooner, with Oracle ditching HDS as an OEM partner almost immediately after acquiring Sun.

I have read, and heard over the past year that Oracle has been for the most part destroyed in the storage market (servers doing badly as well) as a result since their Sun storage products just are not competitive. Many larger customers have been leaving to the likes of HP and IBM who could offer the “one stop shop” for servers and storage (even before HP bought 3PAR, HP had and still has their OEM’d HDS equipment).

In some informal talks with some HDS folks last year they seemed quite happy that Oracle was no longer an OEM, saying that the people over at Sun/Oracle weren’t competent enough to handle the HDS stuff (*cough* too complicated *cough*), and so HDS just went in direct with most of those customers that Oracle walked away from.

Finally someone at Oracle woke up and realized there still is, and will continue to be for some time a big market for traditional SAN systems, far bigger than the market of customers willing to risk putting their data on cheap SATA controllers on servers running ZFS with high failure rates and poor performance.

So it finally happened, Oracle is buying Pillar. At first look however it really does seem like an odd scenario, from their SEC filing

The Earn-Out therefore will only be paid to Mr. Ellison, his affiliates and, if applicable, to the other Pillar Data stockholders and option holders if the Net Revenues during Year 3 of the Earn-Out Period exceed the Net Losses, if any, during the entire Earn-Out Period.

There’s no specific mention whether or not Larry is going to pay himself back for the $500M+ in loans he has given to Pillar over the years, so I suppose not. In any case it won’t be until the end of 2014 when we might discover what value Oracle has placed on Pillar. One commenter on The Register mentions Pillar’s revenue as $29M per year, don’t know where that came from though, doing some searching myself I found references to roughly $70M in revenue, to $3B in revenue (if that was the case they would of IPO’d)

I think it’s a good deal for Pillar to, they get much better validation on their products in front of customers.

I’ve gone through quite a bit of the information on the Pillar web site and to-date I have not seen anything that would make me want to buy their product, and have yet to hear any positive words coming from the people I know in the street/industry (granted my community is limited).

But it sure as hell beats anything that Oracle has been offering their customers recently, that alone may be enough to drive a decent amount of sales.

Pillar posted some updated SPC-1 numbers recently, a significant improvement over their original numbers, though nothing ground breaking from a competitive standpoint.

In other news, two early social media giants have fallen – MySpace being acquired for $35M, and Friendster re-inventing itself as a gaming site with Facebook authentication. I’d bet the infrastructure behind Myspace is worth about $35M by itself – Newscorp really wanted out!

December 9, 2010

Java fallout from Oracle acquisition intensifies

Filed under: News,Random Thought — Tags: , — Nate @ 1:51 pm

I was worried about this myself, almost a year ago to the day raised my concerns about Oracle getting control of Java, and the fallout continues. Oracle already had BEA’s JRockit, it’s too bad they had to get Sun’s JVM too.

Apache seems to have withdrawn from most things related to Java today according to our friends at The Register.

On Thursday, the ASF submitted its resignation from JCP’s Java Standard and Enterprise Edition (SE/EE) Executive Committee as a direct consequence of the Java Community Process (JCP) vote to approve Oracle’s roadmap for Java 7 and 8.

The ASF said it’s removing all official representatives from all JSRs and will refuse to renew its JCP membership and EC position.

Java was too important a technology to be put in the hands of Oracle.

Too bad..

November 6, 2010

The cool kids are using it

Filed under: Random Thought — Tags: , , — Nate @ 8:24 pm

I just came across this video, which is animated, involves a PHP web developer ranting to a psychologist about how stupid the entire Ruby movement is. It’s really funny.

I remember being in a similar situation a few years ago, the company had a Java application which drove almost all of the revenue of the company(90%+), and a perl application that they acquired from a ~2 person company and were busy trying to re-write it in Java.

Enter stage left: Ruby. At that point (sometime in 2006/2007), I honestly don’t think I had ever heard of Ruby before. But a bunch of the developers really seemed to like it, specifically the whole Ruby on Rails thing. We ran it on top of Apache with fastcgi. It really didn’t scale well at all (for fairly obvious reasons that are documented everywhere online). As time went on the company lost more and more interest in the Java applications and wanted to do everything in Ruby. It was cool (for them). Fortunately scalability was never an issue for this company since they had no traffic. At their peak they had four web servers, that on average peaked out at about 30-35% CPU.

It was a headache for me because of all of the modules they wanted to install on the system, and I was not about to use “gem install” to install those modules(that is the “ruby way” I won’t install directly from CPAN either BTW), I wanted proper version controlled RPMs. So I built them, for the five different operating platforms we supported at the time (CentOS 4 32/64bit, CentOS 5 32/64bit Fedora Core 4 32-bit — we were in transition to CentOS 5 32/64-bit). Looking back at my cfengine configuration file there was a total of 108 packages I built while I was there to support them, and it wasn’t a quick task to do that.

Then add to the fact that they were running on top of Oracle (which is a fine database IMO), mainly because that was what they had already running with their Java app. But using Oracle wasn’t the issue — the issue was their Oracle database driver didn’t support bind variables. If you have spent time with Oracle you know this is a bad thing. We used a hack which involved setting a per-session environment variable in the database to force bind variables to be enabled, this was OK most of the time, but it did cause major issues for a few months when a bad query got into the system, caused the execution plans to get out of whack and massive latch contention. The fastest way to recover the system was to restart Oracle. The developers, and my boss at the time were convinced it was a bug in Oracle. I was convinced it was not because I had seen latch contention in action several times in the past. After a lot of debugging the app and the database in consultation with our DBA consultants they figured out what the problem was — bad queries being issued from the app. Oracle was doing exactly what they told it to do, even if it means causing a big outage. Latch contention is one of the performance limits of Oracle that you cannot solve by adding more hardware. It seems like it could be at first because the results of it are throughput drops to the floor, and CPUs go to 100% usage instantly.

At one point to try to improve performance and get rid of memory leaks I migrated the Ruby apps from fastcgi to mod_fcgid. Which had a built in ability to automatically restart it’s threads after they had served X number of requests. This worked out great, really helped improve operations. I don’t recall if it had any real impact on performance but because the memory leaks were no longer a concern that was one less thing to worry about.

Then one day we got in some shiny new HP DL380 G5s with dual proc quad core processors with either 8 or 16GB of memory. Very powerful, very nice servers for the time. So what was the first thing I tried? I wanted to try out 64-bit, be able to take better advantage of the larger amount of memory. So I compiled our Ruby modules for 64-bit, installed a 64-bit CentOS 5.2 I think it was at the time(other production web servers were running CentOS 5.2 32-bit), installed 64-bit Ruby etc. Launched the apps, from a functional perspective they worked fine. But from a practical perspective it was worthless. I enabled the web server in production and it immediately started gagging on it’s own blood, load shot through the roof, requests were slow as hell. So I disabled it, and things returned to normal. Tried that a few more times and ended up giving up — went back to 32-bit. The 32-bit system could handle 10x the traffic of the 64-bit system. Never found out what the issue was before I left the company.

From an operational perspective, my own personal preference for web apps is to run Java. I’m used to running Tomcat myself but really the container matters less to me. I like war files, it makes deployment so simple. And in the Weblogic world I liked ear files (I suspect it’s not weblogic specific it’s just the only place I’ve ever used ear files). One archive file that has everything you need built into it. Any extra modules etc are all there. I don’t have to go compile anything, install a JVM, install a container and drop a single file to run the application. OK maybe some applications have a few config files (one I used to manage had literally several hundred XML config files — poor design of course).

Maybe it’s not cool anymore to run Java I don’t know. But seeing this video reminded me of those days when I did have to support Ruby on production and pre-production systems, it wasn’t fun, or cool.

Powered by WordPress