Diggin' technology every day


Facebook going in strong with Vertica

TechOps Guy: Nate

Came across this job post yesterday, thought it was interesting, somewhat on the heels of Facebook becoming a Vertica customer.

I still find it interesting at least given Facebook's history of in house solutions.

From the job posting

Do you like working with massive MPP databases? Are you interested in building one of the largest MPP data warehouses in the world? If yes, we want to talk to you. Facebook is seeking a Database Engineer to join the IT Engineering Infrastructure team to build the largest Vertica data warehouse in the world.


Facebook deploying HP Vertica

TechOps Guy: Nate

I found this interesting. Facebook - a company that designs their own servers(in custom racks no less), writes their own software, does fancy stuff in PHP to make it scale, is a big user of Hadoop, massive horizontal scaling of sharded MySQL systems, and has developed an exabyte scale query engine -  is going to be deploying HP Vertica as part of their big data infrastructure.

Apparently announced at HP Discover

“Data is incredibly important: it provides the opportunity to create new product enhancements, business insights, and a significant competitive advantage by leveraging the assets companies already have. At Facebook, we move incredibly fast. It’s important for us to be able to handle massive amounts of data in a respectful way without compromising speed, which is why HP Vertica is such a perfect fit.”

Not much else to report on, just thought it was interesting given all the stuff Facebook tries to do on it's own.


When a server is a converged solution

TechOps Guy: Nate

Thought this was kind of funny/odd/ironic/etc...

I got an email a few minutes ago which is talking about HP App System for Vertica. Which, among other things HP describes as being able to

This solution delivers system performance and reduces implementation from months to hours.

I imagine they are referring to competing solutions and not comparing to running Vertica on bare metal. In fact it may be kind of misleading as Vertica is very open - you can run it on physical hardware (any hardware really), virtual hardware, and even some cloud services (it is supported in *shudder*Amazon even..). So you can get implementation of a basic Vertica system without buying anything new.

But if you are past the evaluation stage, and perhaps outgrew your initial deployment and want to grow into something more formal/dedicated, then you may need some new hardware.

HP pitches this as a Converged Solution. So I was sort of curious what HP solutions are they converging here?

Basically it's just a couple base configurations of HP DL380G8s with internal storage (these 2U servers support up to 25 2.5" disks).  They don't even install Vertica for you --

HP Vertica Analytics Platform software installation is well documented and can be installed by customers.

They are kind enough to install the operating system though (no mention of any special tuning, other than they say it is "Standard" so I guess no tuning).

No networking included(outside of the servers as far as I can tell), the only storage is the internal DAS. Minimum of three servers is required so some sort of 10GbE switches are required (since the severs are 10GbE, you can run Vertica fine on 1GbE too for smaller data sets).

I would of expected the system to come with Vertica pre-installed, or automatically installed as part of setup and have a trial license built into the system.

Vertica is very easy to install and configure the basics, so in the grand scheme of things this AppSystem might save the average Vertica customer a few minutes.

Vertica is licensed normally by the amount of data stored in the cluster (pre-compression / encoding).  The node count, CPU count, memory, spindles doesn't matter. There is a community edition that goes up to 3 nodes, and 3TB (it has some other software limitations - and as far as I know there is no direct migration path from community to enterprise without data export/import).

Don't get me wrong I think this is a great solution, very solid server, with a lot of memory and plenty of I/O to provide a very powerful Vertica experience. Vertica's design reduces I/O requirements by up to ~90% in some cases, so you'd be probably shocked the amount of performance you'd get out of just one of these 3 node clusters, even without any tuning at the Vertica level.

Vertica does not require a fancy storage system, it's really built with DAS in mind. Though I know there are bunches of customers out there that run it on big fancy storage because they like the extra level of reliability/availability.

I just thought it was kind of strange some of the marketing behind it, saving months of time, being converged infrastructure and what not..

It makes me think(if I had not installed Vertica clusters before) that if I want Vertica and don't get this AppSystem then I am in a world of hurt when it comes to setting Vertica up. Which is not the right message to send.

Here is this wonderful AppSystem that is in fact -- just a server with RHEL installed.

For some reason I expected more.

Tagged as: 2 Comments

Big pop in Tableau IPO

TechOps Guy: Nate

I was first introduced to Tableau (and Vertica) a couple of years ago at a local event in Seattle. Both products really blew me away(and still do to this day). Though it's not an area I spend a lot of time in - my brain struggles with anything analytics related (even when using Tableau, same goes for Splunk, or SQL). I just can't make the connections, when I come across crazy Splunk queries that people write I just stare at it for a while in wonder(as in I can't possibly imagine how someone could of come up with such a query even after working with Splunk for the past six years).. then I copy+paste it and hope it works.

Sample Tableu reports pulled from google images

But that doesn't stop me from seeing an awesome combination that is truly ground breaking both in performance and ease of use.

I've seen people try to use Tableau with MySQL for example and they fairly quickly give up in frustration at how slow it is. I remember being told that Tableau used to get a bunch of complaints from users years ago saying how slow it seemed to be -- but it really wasn't Tableau's fault it was the slow back end data store.

Vertica unlocks Tableau's potential by providing a jet engine to run your queries against. Millions of rows? hundreds of millions? No problem.. billions ? It'll take a bit longer but shouldn't be an issue either. Try that with most other back ends and well you'll be waiting there for days if not weeks.

Tableau is a new generation of data visualization technology that is really targeted at the Excel crowd. It can read in data from practically anything(Excel files included), and it provides a seamless way to analyze your data and provide fancy charts and graphs, tables and maps..

It's not really for the hard core power users who want to write custom queries. Though I still think it is useful for those folks. A great use case for Tableau is for the business users to play around with it, and come up with the reports that they find useful, then the data warehouse people can take those results and optimize the warehouse for those types of queries (if required). It's a lot simpler and faster than the alternative..

I remember two years ago I was working with a data warehouse guy at a company and we were testing Tableau with MySQL at the time actually (small tables), just playing around, he poked around, created some basic graphs and drilled down into them. In all we spent about 5 minutes on this task and we found some interesting information. He said if he had to do that in MySQL queries himself it would of taken him roughly two days. Running query after query and then building new queries based on results etc.  From two days to roughly five minutes -- for a very experienced SQL/data warehouse person.

Tableau has a server component as well, which you can publish your reports for others to see with a web browser or mobile device, the server can also of course directly link to your data to get updates as frequently as you want them.

You can have profiles and policies, one example Tableau gave me last year was one big customer enforces certain color codes across their organization so no matter what they are looking at they know Blue means X and Orange means Y. This is enforced at the server level, so it's not something people have to worry about remembering. They can also enforce policies around reporting so that the term "XYZ" is always the result of "this+that", so people get consistent results every time -- not a situation where someone interprets something one way, and another person another way. Again this is enforced at the server level, reducing the need for double checking and additional training.

They also have APIs - and users are able to embed Tableau reports directly into their applications and web sites(through the server component). I know one organization where almost all of their customer reporting is presented with Tableau - I'm sure it saved them a ton of time trying to replicate the behavior in their own code. I've seen folks try to write reporting UIs in past companies and usually what comes out is significantly sub par because it's a complicated thing to get right. Tableau makes it easy, and probably very cost effective relative to full time developers taking months/years to try to do it yourself.

It's one of the few products out there that I am really excited about, and I've seen some amazing stuff done with the software in a very minimal amount of time.

Tableau has a 15 day evaluation period if you want to try it out -- it really should be more, but whatever.  Vertica has a community edition which you can use as a sort of long term evaluation - it's limited to 1TB of data and 3 cluster nodes. You can get a full fledged enterprise evaluation from Vertica as well if you want to test all of the features.

I wrote some scripts at my current company to refresh/import about 150GB of data from our MySQL systems to Vertica each night. It is interesting to see MySQL struggle to read the data out, and Vertica is practically idle as it ingests it (I'd otherwise normally think that the writing of the data would be more intensive than the reading). In order to improve performance I compiled a few custom MySQL binaries that allowed me to run MySQL queries and pipe the results directly into Vertica (instead of writing 10s of GBs to disk only to read it back again). The need for the custom binaries is MySQL by default only supports tab delimited results which was not sufficient for this data set (I actually compiled 3-4 different binaries with different delimiters depending on the tables  - managed to get ~99.99999999% of the rows in without further effort). Also wrote a quick perl script to fix some of the invalid data like invalid time stamps which MySQL happily allows but Vertica does not.

Sample command:

$MYSQL --raw --batch --quick --skip-column-names -u $DB_USERNAME --password="${DB_PASSWORD}" --host=${DB_SERVER} $SOURCE_DBNAME -e "select * from $MY_TABLE" | $DATA_FIX | vsql -w $VERTICA_PASSWORD -c "COPY ${VERTICA_SCHEMA}.${MY_TABLE} FROM STDIN DELIMITER '|' RECORD TERMINATOR '##' NULL AS 'NULL' DIRECT"


Oh and back to the topic of the post - Tableau IPO'd today (ticker is DATA) - as of last check it is up 55%.

So, congrats Tableau on a great IPO!



Fusion IO enhances MySQL performance further

TechOps Guy: Nate

This seems pretty neat. Not long ago Fusion IO announced their first new real product refresh in quite a while which offers significantly enhanced performance.

Today I see another related article that goes into something more specific, from Data Center Knowledge -

Fusion-io also announced a new extension to its VSL (Virtual Storage Layer) software subsystem for conducting Atomic Writes in the popular MySQL open source database. Atomic Writes are an operation in which a processor can simultaneously write multiple independent storage sectors as a single storage transaction. This accelerates mySQL and gives new features powered by the flexibility of sophisticated flash architectures. With the new Atomic Writes extension, Fusion-io testing has observed 35 percent more transactions per second and a 2.5x improvement in performance predictability compared to conducting the same MySQL tests without the Atomic Writes feature.

I know that Facebook is a massive user of Fusion IO for their MySQL database farms, I suspect this feature was made for them! Though it can benefit everyone.

My only question would be can this Atomic write capability be used by MySQL when running through the ESX storage layer, or does there need to be more native access from the OS.

About the new product lines, from The Register -

The ioDrive 2 comes in SLC form with capacities of 400GB and 600GB. It can deliver 450,000 write IOPS working with 512-byte data blocks and 350,000 read IOPS. These are whopping great increases, 3.3 times faster for the write IOPS number, over the original ioDrive SLC model which did 135,000 write IOPS and 140,000 read IOPS. It delivered sequential data at 750-770MB/sec whereas the next-gen product does it at 1.5GB/sec, around two times faster.
All the products will ship in November. Prices start from $5,950

The cards aren't available yet, wonder how accurate those numbers will end up being? But in any case, even if they were over inflated by a large amount  that's still an amazing amount of I/O.

On a related note I was just browsing the Fusion IO blog which mentions this MySQL functionality as well and saw that Fusion IO was/is showing off a beefy 8-way HP DL980 with 14 HP-branded IO accelerators  at Oracle Openworld -

We're running Oracle Enterprise Edition database version 11g Release 2 on a single eight processor HP ProLiant DL980 G7 system integrated with 14 Fusion ioMemory-based HP IO Accelerators, achieving performance of more than 600,000 IOPS with over 6GB/s bandwidth using a real world, read/write mixed workload.


the HP Data Accelerator Solution for Oracle is configured with up to 12TB of high performance flash[..]

After reading that I could not help but think how HP's own Vertica, with it's extremely optimized encoding and compression scheme would run on such a beast. I mean if you can get 10x compression out of the system(Vertica's best-case real world is 30:1 for reference), get a pair of these boxes (Vertica would mirror between the two) and you have upwards of 240TB of data to play with.

I say 240TB because of the way Vertica mirrors the data it allows you to store it in a different sort order on the mirror allowing for even faster access if your querying the data in different ways. Who knows - with the compression you may be able to get much better than 10:1 depending on your data.

Vertica is so fast that you will probably end up CPU bound more than anything else - 80 cores per server is quite a lot though! The DL980 supports up to 16 PCI Express slots so even with 14 cards that still leaves room for a couple 10GigE ports and/or Fibre channel or some other form of connectivity other than what's on the motherboard (which seems to have an optional dual port 10GbE NIC)

With Vertica's licensing (last I checked) starting in the 10s of thousands of dollars per raw TB (before compression), it falls into the category for me to blow a ton of money on hardware to make it run the best it possibly can (same goes for Oracle - though Standard Edition to a lesser degree). Vertica is coming out with a Community Edition soon which I believe is free, I don't recall what the restrictions are I think one of them was it was limited to a single server, I don't recall yet hearing on what the storage limits might be(I'd assume there would be some limit maybe half a TB or something)


So easy it could be a toy, but it’s not

TechOps Guy: Nate

I was at a little event thrown for the Vertica column-based database, as well as Tableau Software, a Seattle-based data visualization company. Vertica was recently acquired by HP for an undisclosed sum. I had not heard of Tableau until today.

I went in not really knowing what to expect, have heard good things about Vertica from my friend over there but it's really not an area I have much expertise in.

I left with my mouth on the floor. I mean holy crap that combination looks wicked. Combining the world's fastest column based data warehouse with a data visualization tool that is so easy some of my past managers could even run it. I really don't have words to describe it.

I never really considered Vertica for storing IT-related data, and they brought up a case study with one of their bigger customers - Comcast who sends more than 65,000 events a second into a vertica database (including logs, SNMP traps and other data). Hundreds of terabytes of data with sub second query response times. I don't know if they use Tableau software's products or not. But there was a good use case for storing IT data in Vertica.

(from Comcast case study)

The test included a snapshot of their application running on a five-node cluster of inexpensive servers with 4 CPU AMD 2.6 GHz core processors with 64-bit 1 MB cache; 8 GB RAM; and ~750 GBs of usable space in a RAID- 5 configuration.
To stress-test Vertica, the team pushed the average insert rate to 65K samples per second; Vertica delivered millisecond-level performance for several different query types, including search, resolve and accessing two days’ worth of data. CPU usage was about 9%, with a fluctuation of +/- 3%, and disk utilization was 12% with spikes up to 25%.

That configuration could of course easily fit on a single server. How about a 48-core Opteron with 256GB of memory and some 3PAR storage or something? Or maybe a DL385G7 with 24 cores, 192GB memory(24x8GB), and 16x500GB 10k RPM SAS disks with RAID 5  and dual SAS controllers with 1GB of flash-backed cache(1 controller per 8 disks). Maybe throw some Fusion IO in there too?

Now I suspect that there will be additional overhead with trying to feed IT data into a Vertica database since  you probably have to format it in some way.

Another really cool feature of Vertica - all of it's data is mirrored at least once to another server, nothing special about that right? Well they go one step further, they give you the ability to store the data pre-sorted in two different ways, so mirror #1 may be sorted by one field, and mirror #2 is sorted by another field, maximizing use of every copy of the data, while maintaining data integrity.

Something that Tableu did really well that was cool was you don't need to know how you want to present your data, you just drag stuff around and it will try to make intelligent decisions on how to represent it. It's amazingly flexible.

Tableu does something else well, there is no language to learn, you don't need to know SQL, you don't need to know custom commands to do things, the guy giving the presentation basically never touched his keyboard. And he published some really kick ass reports to the web in a matter of seconds, fully interactive, users could click on something and drill down really easily and quickly.

This is all with the caveat that I don't know how complicated it might be to get the data into the database in the first place.

Maybe there are other products out there that are as easy to use and stuff as Tableau I don't know as it's not a space I spend much time looking at. But this combination looks incredibly exciting.

Both products have fully functional free evaluation versions available to download on the respective sites.

Vertica licensing is based on the amount of data that is stored (I assume regardless of the number of copies stored but haven't investigated too much), no per-user, no per-node, no per-cpu licensing. If you want more performance, add more servers or whatever and you don't pay anything more. Vertica automatically re-balances the cluster as you add more servers.

Tableau is licensed as far as I know on a named-user basis or a per-server basis.

Both products are happily supported in VMware environments.

This blog entry really does not do the presentation justice, I don't have the words for how cool this stuff was to see in action, there aren't a lot of products or technologies that I get this excited about, but these has shot to near the top of my list.

Time to throw your Hadoop out the window and go with Vertica.

Tagged as: 5 Comments

Vertica snatched by HP

TechOps Guy: Nate

Funny timing! One of my friends who used to work for 3PAR left 3PAR not long after HP completed the acquisition and he went to Vertica, which is a scale out column-based distributed high performance database. Certainly not an area I am well versed in but I got a bit of info a couple weeks ago and the performance numbers are just outstanding, the kind of performance gains that you really probably have to see to believe, fortunately for users their software is free to download, and it sounds like it is easy to get up and running (I have no personal experience with it, but would like to see it in action at some point soon). Performance gains up up to 10,000% are not uncommon vs traditional databases.

It really sounds like an awesome product that can do more real time analysis on large amounts of data (from a few gigs to over a Petabyte). Something that Hadoop users out there should take notice of. If you recall last year I wrote a bit about organizations I have talked to that were trying to do real time with hadoop with (most likely) disastrous results, it's not built for that, never was, which is why Google abandoned it (well not hadoop since they never used the thing but Mapreduce technology in general at least as far as their search index is concerned they may use it for other things). Vertica is unique in that it is the only product of it's kind in the world that has a software connector that can connect hadoop to Vertica. Quite a market opportunity. Of course a lot of the PHB-types are attracted to Hadoop because it is a buzzword and because it's free. They'll find out the hard way that it's not the holy grail they thought it was going to be and go to something like Vertica kicking and screaming.

So back to my friend, he's back at HP again, he just couldn't quite escape the gravitational pull that was HP.

Also somewhat funny as it wasn't very long ago that HP announced a partnership with Microsoft to do data warehousing applications. Sort of reminds me when NetApp tried to go after Data Domain, mere days before they announced their bid they put out a press release saying how good their dedupe was..

Oh and here's the news article from our friends at The Register.

The database runs in parallel across multiple machines, but has a shared-nothing architecture, so the query is routed to the data and runs locally. And the data for each column is stored in main memory, so a query can run anywhere from 50 to 1,000 times faster than a traditional data warehouse and its disk-based I/O – according to Vertica.

The Vertica Analytics Database went from project to commercial status very quickly – in under a year – and has been available for more than five years. In addition to real-time query functions, the Vertica product continuously loads data from production databases, so any queries done on the data sets is up to date. The data chunks are also replicated around the x64-based cluster for high availability and load balancing for queries. Data compression is heavily used to speed up data transfers and reduce the footprint of a relational database, something on the order of a 5X to 10X compression.

Vertica's front page now has a picture of a c Class blade enclosure, jus think of what you can analyze with a enclosure filled with 384 x 2.3Ghz Opteron 6100s (which were released today as well and HP announced support for them on my favorite BL685c G7), and 4TB of memory all squeezed into 10U of space.

If your in the market for a data warehouse / BI platform of sorts, I urge you to at least see what Vertica has to offer, it really does seem revolutionary, and they make it easy enough to use that you don't need an army of PhDs to design and build it yourself (i.e. google).

Speakin' of HP, I did look at what the new Palm stuff will be and I'm pretty excited I just wish it was going to get here sooner. I went out and bought a new phone in the interim until I can get my hands on the Pre 3 and the Touchpad. My Pre 1 was not even on it's last legs it was in a wheelchair and a oxygen bottle. New phone isn't anything fancy just a feature phone, it does have one thing I'm not used to having though, battery life. The damn thing can go easily 3 days and the battery doesn't even go down by 1 bar. And I have heard from folks that it will be available on Sprint, which makes me happy as a Sprint customer. Still didn't take a chance and extend my contract just in case that changes.

Tagged as: , , No Comments