[rtg] buffering mechanism

Matt Provost mprovost at termcap.net
Thu May 25 20:31:37 EDT 2006


On Thu, May 25, 2006 at 10:11:29AM -0700, bill fumerola wrote:
> On Wed, May 24, 2006 at 04:07:16PM -0700, Matt Provost wrote:
> > Yeah I saw this before, but like you said it doesn't patch cleanly
> > against the current source.
> 
> it did when i first sent it out (aug 2004) and i merged the vendor source
> and regenerated the patch monthly until i left yahoo in july 2005. i
> kept three perforce branches (vendor, yahoo, public) and some makefile
> magic to accomplish this.
> 
> anyways.
> 
> >                              The db stuff changed quite a bit with the
> > new drivers. In any case, the buffering that I did happens outside of
> > the db drivers so it will work with any of them. If I had more time I'd
> > love to get some of your changes ported to the new version, but I don't
> > at the moment.
> 
> the sqlbuf code i pointed to does the buffering in an db independent
> way. i haven't looked at the current code. doing buffering completely
> w/o some db-dependent knowledge has a few problems that spring to mind.
> 
> 0) does the current code coalesce rows over a single poll period? this
> is the single largest performance bottleneck that the old code suffered.

No, it would have to know about tables so it can pull a bunch of entries
for a single table to do the insert. That information is there but we're
not doing queues per table or anything yet.

> 
> 1) you have to at least know the difference between a soft error (network
> hiccup, server restart on a persistent connection) repair) and a hard
> error (index corruption).  otherwise without treating them differently
> it may cause buffering to occur needlessly on some tables.

I'm sure the error handling could be improved, but it buffers
everything, so it doesn't matter if it should be or not. It needs to
have something to shove an entry back into the queue if it finds an
error, but that isn't done. The code I wrote was a proof of concept and
needs some work before it's production-ready.

> 
> 2) you need to know the max amount of rows or characters a query can
> contain and this is database dependent.
> 
> the new db layer may have abstracted most of that logic out. still i
> wonder how much work it'd be to have written an sqlbuf_pgsql_cfg() and
> sqlbuf_flush_pgsql() versus Yet Another Database Abstraction Layer.
> 

Yeah this is another complication so something in the db driver will
have to return the max number of entries to aggregate at once.

> in fact there are/were a number of things that limit{ed,s} rtg usage in
> high performance (many hosts and/or many targets per host) environments:
> 
> 0) does the buffering code send the data using a helper thread? does it
> do it in between poll periods? does it insert every time a snmp query
> happens? i could run thousands of targets on a sub-10 second interval,
> down the database for an hour, and not a single poll interval was missed
> and every insert was coalesced and buffered.  try that without a helper
> thread.

Yes it now splits threads into pollers and inserters, so you can have
different numbers of each. The inserters don't run in the polling loop
at all, they wake up any time new data is inserted into the buffer and
run until it's empty.

> 
> 1) what is the mutex locking situation? last i checked if you increased
> max threads too large you could have every thread hit the same device.
> set too low and you could have one device stall all your threads. my
> version had host-based locking so no N>1 targets in a host{} stanza would
> be polled until the first was complete. this was my second largest
> performance improvement.

This is part of having the poller be aware of hosts in some way, which
it currently isn't.

> 
> 2) related to #0: per-thread db connections are just a wasteful use of
> resources. if they're still there i'm sorry for whoever i just offended.

See above, this allows for different numbers of pollers and db threads.

> 
> 3) per-instance timeouts, retry, snmp port, etc need to be per-host.
> some devices can be treated aggressively, some need tender care. most
> globally configured parameters in rtg should be inherited but able to
> override.

I believe a lot of this is done in the new targets format, if not then
it should be added.

> 
> 4) removal of targets that return hard errors or consistently timeout.
> the user needs to be able to define targets that stay no matter what and
> define how many times constitutes "consistently".

Not done.

> 
> 5) i still hate libtool. you guys were willing to carry a few thousand
> #ifdefs for insane reasons (NEW_TARGET_FORMAT, FEATURES) but a handful
> of HAVE_MYSQL/HAVE_PGSQL weren't going to work? can you have both compiled
> in with the current code? is there any consideration that you could have
> multiple databases at some point and point some targets (hosts?) at one
> and some at another? could these be one mysql and one pgsql?

I'm not a huge fan of libtool, but it works. I didn't feel like learning
how to compile shared libraries on Solaris or HPUX or whatever people
are running and it seems to take care of that. The #ifdef situation
isn't maintainable, the idea is to split the mysql driver into versions
for <4.1 and >=4.1, plus postgres, plus maybe oracle...

I'm not sure that connecting to multiple databases is really worth the
effort; it seems like it would be easy enough to run two poller instances.

> 
> 6) there were some minor performance points where snmp_sessions were
> being created and torn down per-target at runtime when they could be
> cached and used per-host. snmp oids were being compiled at runtime instead
> of at config-time. etc. i'm not sure how measurable the performance
> difference was but it sure made the code cleaner.
> 

This is a good idea, but I haven't looked at it.

I think that the main issue is that there was an effort to freeze the
code and get 0.8 out the door. So everyone was holding off on new
features other than what was on the roadmap, I think with the
expectation that the release would happen sooner rather than later. But
the new version hasn't been finished, so all the other stuff is backed
up. I think it's mostly because of a lack of developer time, coupled
with the fact that I don't think most of us are even using rtg in a
production setting anymore so we're less pressured ourselves to get it
done.

Maybe we should just start adding features that have code already
written and forget about having a solid release version. That's not
really my decision to make but it's an idea.

Matt


More information about the RTG mailing list