[rtg] Results with RTG

Matt Provost mprovost at termcap.net
Mon Mar 9 22:40:33 EDT 2009


I've been reading the whole thread but I thought I'd come back here to
respond...

On Mon, Mar 09, 2009 at 02:58:52AM +0100, Harry Marcson wrote:
> We are looking to integrate RTG into our system, because it seems to be the
> only option for our setup. We are looking to poll about 100 switches,
> connecting about 2000 servers.
> 
> Our current RTG testing results for 1 switch polling showed us that RTG:
> 
> 1- Seems to have memory leaks, not new news after reading through the mailing
> list, but was wondering how people are surviving with this, especially the ones
> with the bigger setups. did you apply any specific patch such as yahoo rtg
> (yrtg) or own fixes?

I've got a new version where I believe I've patched all the memory
leaks. valgrind can't find any and my testing shows that it's extremely
stable for weeks.

> 2- Has really boring and bad-looking graphs. Good graphs would be the ones that
> are from Cacti for example.. If anyone has a better rtgplot.cgi or improved
> graphing code, please do share it!

I've actually pulled all of the graphing code out. I think these days
that it doesn't make any sense to have them being built in C using
something as basic as GD. It's just too slow to develop. You can get
much better graph libraries now. I've been meaning to whip something up
using the free Google graphs which look great.

One of the strengths of RTG is that it uses a database as a backend so
it's very easy to write tools that can pull data out. The amount of code
to pull some numbers out of a database and shove them into a graph is
pretty tiny. So I'd rather see lots of custom solutions to this than
having a single output format like RRDtool.

> 3- Bugs in the graphs. A simple example is a server that got a 800Mbps DDOS
> attack and got nullrouted.. RTG did not move the line near the 0Mbps, when it
> crashed. The end result is, once the server was restored a few hours later, the
> graph line went down to the actual usage of about 10Mbps, but showed that
> during the nullroute a 800Mbps usage.

As someone pointed out, you can record zero values. But maybe it is
worth recording a single zero value if the previous value was nonzero,
which would fix your issue. I'll take a look at that.

> 4- Makes matching the actual switch port with the RTG device id a bit of a
> hassle. Has anyone come up with a solution for that?
> 5- Lacks automation when it comes to discovering new devices. Adding switch ips
> is an easy one, but how to run rtg targetmaker.pl every time a new server is
> added to a switch, aswell as restarting the RTG process is a more difficult
> one.

I plan on moving the entire config into the db. This should make it a
lot easier to write tools that can add and remove devices and keep track
of which ones are which. But I haven't even started on this...

I've probably been being too much of a perfectionist with trying to get
my changes done so it's been taking awhile. But since everyone seems
interested I'm going to focus on getting it out there sooner.

Matt


More information about the RTG mailing list