[rtg] buffering mechanism
bill fumerola
billf at mu.org
Thu May 25 13:11:29 EDT 2006
On Wed, May 24, 2006 at 04:07:16PM -0700, Matt Provost wrote:
> Yeah I saw this before, but like you said it doesn't patch cleanly
> against the current source.
it did when i first sent it out (aug 2004) and i merged the vendor source
and regenerated the patch monthly until i left yahoo in july 2005. i
kept three perforce branches (vendor, yahoo, public) and some makefile
magic to accomplish this.
anyways.
> The db stuff changed quite a bit with the
> new drivers. In any case, the buffering that I did happens outside of
> the db drivers so it will work with any of them. If I had more time I'd
> love to get some of your changes ported to the new version, but I don't
> at the moment.
the sqlbuf code i pointed to does the buffering in an db independent
way. i haven't looked at the current code. doing buffering completely
w/o some db-dependent knowledge has a few problems that spring to mind.
0) does the current code coalesce rows over a single poll period? this
is the single largest performance bottleneck that the old code suffered.
1) you have to at least know the difference between a soft error (network
hiccup, server restart on a persistent connection) repair) and a hard
error (index corruption). otherwise without treating them differently
it may cause buffering to occur needlessly on some tables.
2) you need to know the max amount of rows or characters a query can
contain and this is database dependent.
the new db layer may have abstracted most of that logic out. still i
wonder how much work it'd be to have written an sqlbuf_pgsql_cfg() and
sqlbuf_flush_pgsql() versus Yet Another Database Abstraction Layer.
in fact there are/were a number of things that limit{ed,s} rtg usage in
high performance (many hosts and/or many targets per host) environments:
0) does the buffering code send the data using a helper thread? does it
do it in between poll periods? does it insert every time a snmp query
happens? i could run thousands of targets on a sub-10 second interval,
down the database for an hour, and not a single poll interval was missed
and every insert was coalesced and buffered. try that without a helper
thread.
1) what is the mutex locking situation? last i checked if you increased
max threads too large you could have every thread hit the same device.
set too low and you could have one device stall all your threads. my
version had host-based locking so no N>1 targets in a host{} stanza would
be polled until the first was complete. this was my second largest
performance improvement.
2) related to #0: per-thread db connections are just a wasteful use of
resources. if they're still there i'm sorry for whoever i just offended.
3) per-instance timeouts, retry, snmp port, etc need to be per-host.
some devices can be treated aggressively, some need tender care. most
globally configured parameters in rtg should be inherited but able to
override.
4) removal of targets that return hard errors or consistently timeout.
the user needs to be able to define targets that stay no matter what and
define how many times constitutes "consistently".
5) i still hate libtool. you guys were willing to carry a few thousand
#ifdefs for insane reasons (NEW_TARGET_FORMAT, FEATURES) but a handful
of HAVE_MYSQL/HAVE_PGSQL weren't going to work? can you have both compiled
in with the current code? is there any consideration that you could have
multiple databases at some point and point some targets (hosts?) at one
and some at another? could these be one mysql and one pgsql?
6) there were some minor performance points where snmp_sessions were
being created and torn down per-target at runtime when they could be
cached and used per-host. snmp oids were being compiled at runtime instead
of at config-time. etc. i'm not sure how measurable the performance
difference was but it sure made the code cleaner.
i don't mean to piss on any parades, but the community could have a lot
of the above written and maintained by me at the expense of my former
employer. now it's going to take someone rewriting it or bringing my
patch up to date. hell, even the last time my code was committed it was
behind giant #ifdef FUMEROLA and turned off essentially.
i've considered bringing my work up-to-date myself but: i don't have any
warmer feeling that it would be committed this time than i did last time.
i don't have a giant environment to test it on like i did before but i
could get around that. more to the point, if i'm going to rewrite/update
the sql buffering/coalescing, config reader, config internal storage,
half the snmp code, and all the #ifdef and other C cleanups... then i
start to realize that's 50% of the code, i start thinking about code
forks and then i get angry at the development model.
anyways, i bring up these performance/usability/feature things once a
year now and i'll crawl back into my hole for 2006 until i'm convinced
my work will end up in the tree or someone pays me my hourly rate to
give me something to walk away with if it doesn't get committed.
apologies for sour grapes and posting before my morning coffee,
-- bill
More information about the RTG
mailing list