Database for greylisting plugin.

Discussion:

Graham Miller

2006-12-08 03:23:34 UTC

Hi all,

I am designing a plugin for mailfront to do greylisting. After writing (in
C) an in-line filter (based on IP address only) to sit between rblsmtpd and
smtpfront, I am now encouraged to do it properly after seeing great results.
(160 per day down to 10 in the test mailbox).

I am designing the plugin to use IP address, Recipient, and Sender as a
triplet (maybe configurable down to ip-recipient), and am having a bit of
trouble choosing a database type to use. I hope that those here can help
with some feedback to make it as good as possible.

On one hand, the greylist.org site refers to the relaydelay perl script
version and it uses mysql. Many others seem to use mysql too.

mysql Pros:
- central (or replicated) DB for server farms.
- extra valuable info available with sql query.
- known robust DB.
- reasonably quick data access.

mysql Cons:
- need to handle "no access to db" issue.
- slower than file system access (perhaps).
- extra layer of complexity.
- not all sites use mysql.

On another hand, simpler systems I have seen use Berkeley style file system
DBs. These are more prevalent on unix/linux hosts, but lack simple add hock
query ability and detailed record structures.

Another option is DJBs cdb, but that seems to be tuned for read only
applications rather than write intensive ones like a greylisting program.

Another thought I have had is some kind of file system thing like DJB uses
for the mail queue, or an indexed directory structure of some kind with
filenames made up of the triplet.

The current greylistd C program I am using in a beta test server on one
domain that gets lots of spam, is using the filesystem and files named after
their IP address. This seems to be working ok so far, but I have concerns
about the number of files that end up being cached by the OS and robbing
server RAM.

So anyone want to wade in here and put in their experience or ideas?

Any comments about the trade-off of post event analysis versus simpler
storage needs?

thanks
Graham Miller

George Georgalis

2006-12-08 03:58:27 UTC

Permalink

Post by Graham Miller
- central (or replicated) DB for server farms.

is that really needed for greylisting?

Post by Graham Miller
The current greylistd C program I am using in a beta test server on one
domain that gets lots of spam, is using the filesystem and files named after
their IP address. This seems to be working ok so far, but I have concerns
about the number of files that end up being cached by the OS and robbing
server RAM.

an advantage there is simple (emergency) shell manipulations

sqlite has a facility to use in-memory tables, you will probably
want to add a capacity to dump/load data to disk. the source is
mature and in public domain, so you can include the libs in your
package.

when a mx passes you could make the record 4 bytes (ip) and an
integer (time). In any event, I have the sense your app won't
cause memory problems, but it may be simple to optionally use on
disk tables.

// George

--
George Georgalis, systems architect, administrator <IXOYE><

Graham Miller

2006-12-08 04:49:21 UTC

Permalink

Post by George Georgalis

Post by Graham Miller
- central (or replicated) DB for server farms.

is that really needed for greylisting?

I read on the greylist-users mailing list that is was a consideration, but
am not convinced it is necessary yet. In my case it will not be needed as we
are only a small hosting company. I was kinda thinking about other potential
users when I wrote that.

Post by George Georgalis

Post by Graham Miller
The current greylistd C program I am using in a beta test server on
one domain that gets lots of spam, is using the filesystem and files
named after their IP address. This seems to be working ok so far,
but I have concerns about the number of files that end up being
cached by the OS and robbing server RAM.

an advantage there is simple (emergency) shell manipulations

Now that is very true. If whitelist is stored separately to greylist, then
greylist can be blown away totally in an emergency if needed. And if I used
another database that was in the file system or ram, then there is also a
similar (if not heavier) RAM penalty to caching files directory entries.

I will be making the system fail open by default so if there is a DB
(whichever) error, then it will allow mail to pass. Probably make it a cli
option so it can be selected on a per instance basis.

Post by George Georgalis
sqlite has a facility to use in-memory tables, you will probably
want to add a capacity to dump/load data to disk. the source is
mature and in public domain, so you can include the libs in your
package.

Gosh, that is really great. Just checked out the web site and it looks very
powerful with a CLI to do management and add hoc queries. Thanks for the
tip.

Post by George Georgalis
when a mx passes you could make the record 4 bytes (ip) and an
integer (time). In any event, I have the sense your app won't
cause memory problems, but it may be simple to optionally use on
disk tables.

I guess if it ever became popular with others, then optional storage types
would be the go. I will try to make the storage interface as simple as
possible so DB engines can be plugged in relatively easily.

Thanks for you input George.

regards
Graham

Bruce Guenter

2006-12-08 04:33:16 UTC

Permalink

Post by Graham Miller
On another hand, simpler systems I have seen use Berkeley style file system
DBs. These are more prevalent on unix/linux hosts, but lack simple add hock
query ability and detailed record structures.

One con to using the Berkeley/SleepyCat db libraries is that they change
their file format every other release, and you can't even read the old
databases with new libraries. On the other hand, they work, and they
work well.

Post by Graham Miller
Another option is DJBs cdb, but that seems to be tuned for read only
applications rather than write intensive ones like a greylisting program.

No, as you say, CDBs are only appropriate for data that changes rarely.

If you can arrange to have either completely empty files or small
numbers of files with larger contents, then I wouldn't worry about the
caching hit. Either using the filesystem as the database or adding a
database layer, the OS is going to have to cache the data, and the file
names themselves are not any more expensive than putting it in a
database. The OS will want to cache an inode to provide timestamp
information, which starts to get a bit bigger, but still not a huge
issue.

Using the filesystem is vastly simpler for debugging and emergency
maintenance, as standard file tools work.

--
Bruce Guenter <***@untroubled.org> http://untroubled.org/

Graham Miller

2006-12-08 06:17:08 UTC

Permalink

Post by Bruce Guenter
One con to using the Berkeley/SleepyCat db libraries is that they
change their file format every other release, and you can't even read
the old databases with new libraries. On the other hand, they work,
and they work well.

Well we won't go there then. <grin>

Post by Bruce Guenter
If you can arrange to have either completely empty files or small
numbers of files with larger contents, then I wouldn't worry about the
caching hit. Either using the filesystem as the database or adding a
database layer, the OS is going to have to cache the data, and the
file names themselves are not any more expensive than putting it in a
database. The OS will want to cache an inode to provide timestamp
information, which starts to get a bit bigger, but still not a huge
issue.
Using the filesystem is vastly simpler for debugging and emergency
maintenance, as standard file tools work.

This makes a lot of sense.

If it was designed to do basic greylisting (sender-receiver-ip) using the
filesytem as a database (emtpy files and use mtime), then it cannot
differentiate between a triplet combination that was greylisted 8 hours ago
(and never returned) and one that has already returned (which should still
be whitelisted). This is, of course, unless the whitelist was separate from
the greylist and whitelist was checked first. I am not sure I want to do
that. I think I would prefer all entries in one database "table" so the I/O
is kept down to the minimum.

Of course, I might just be totally ignorant of how to use the filesystem as
a database with empty files.

Does any think that the queue algorithm in qmail would be of any use? Or how
about tai coding expiry time into filename? Shudder... would mean a grep to
find a match of triplet perhaps.

It seems from my reading that a smarter system is possible with a record
oriented database. Then separate fields can record first access, last
access, hit/block/fail counts, expiry times, and whitelist flags. This extra
info would allow things like extended expiry times for ips that meet certain
criteria (sort of automatic whitelisting), and selection of a grouping
function to handle server farms not sending from the same ip, and delaying
of 451 error to cope with brain dead sender verification scheme. These to
name a few. With one I/O.

Does anyone know of any better schemes for using the filesystem that might
be useful? Is there more info than the mtime of a file that could be used
and manipulated?

Thanks
Graham

Bruce Guenter

2006-12-13 05:04:14 UTC

Permalink

Post by Graham Miller
If it was designed to do basic greylisting (sender-receiver-ip) using the
filesytem as a database (emtpy files and use mtime), then it cannot
differentiate between a triplet combination that was greylisted 8 hours ago
(and never returned) and one that has already returned (which should still
be whitelisted). This is, of course, unless the whitelist was separate from
the greylist and whitelist was checked first. I am not sure I want to do
that. I think I would prefer all entries in one database "table" so the I/O
is kept down to the minimum.

Use another directory and rename (requires 2 stat checks) or link them
across (requires 2 unlinks to clean up) to whitelist them.

Post by Graham Miller
Of course, I might just be totally ignorant of how to use the filesystem as
a database with empty files.

They don't have to be empty, but if they aren't then minimum allocation
rules start to apply. For example, each 1 byte file in most filesystems
on Linux will use 4kB. Of course, disk is cheap, so this probably isn't
worth worrying about in the long run. However, even 2 stat calls should
be faster than the open+read+close needed to read file contents.

Post by Graham Miller
Does any think that the queue algorithm in qmail would be of any use?

I don't see how. It depends on what part of the "queue algorithm"
you're referring to.

Post by Graham Miller
Or how
about tai coding expiry time into filename? Shudder... would mean a grep to
find a match of triplet perhaps.

No, having to list a directory to find one entry would not be efficient
or scalable.

Post by Graham Miller
It seems from my reading that a smarter system is possible with a record
oriented database. Then separate fields can record first access, last
access, hit/block/fail counts, expiry times, and whitelist flags.

Yes, if you want to record all this, you will need to either store some
data in the files or move to a more advanced DBM.

Post by Graham Miller
This extra
info would allow things like extended expiry times for ips that meet certain
criteria (sort of automatic whitelisting), and selection of a grouping
function to handle server farms not sending from the same ip, and delaying
of 451 error to cope with brain dead sender verification scheme. These to
name a few. With one I/O.

With one I/O you can do the same in a plain text file, and get the bonus
of being able to manipulate the database with command-line tools.

If text files don't cut it, going to a *SQL system would likely be the
next best bet, as the command-line interfaces work well for arbitrary
manipulations of the data. SQL systems typically give many more
advanced features that simply aren't available when using plain files.

There are still many cases for which even the most efficient *SQL
manager will not be able to match the speed and ease of use of using the
filesystem directly as a database. For all SQL systems (except SQLite),
you will have to connect, authenticate, select a database, issue a
query, the server parses the query, reads the data, and sends it back.
Compare that to open, read, close (plus a simple parse), and you see why
I have a preference for files.

Post by Graham Miller
Is there more info than the mtime of a file that could be used
and manipulated?

Much more. You could use some of the mode bits as a bitmask, create
extra links to adjust the link count, stuff it with empty data to adjust
the size. You can also set the access time, but any subsequent open or
read will also set this.

--
Bruce Guenter <***@untroubled.org> http://untroubled.org/