[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Data Compression



On Tue, Feb 01, 2000 at 05:11:32PM -0700, D.F.S. wrote:
>
>[...snip...]
>My vision was more along the lines of the fact we could easily
>fit this ENTIRE huge database of info that was shipped on a CD into
>maybe 10-15 megabytes.
>With that handy, The number and variety of programs, that could
>provide a wealth of info we have not even envisioned yet, is huge.

   I think I need to get moving on getting the data to the winner
of the recent "server donation drive", (thanks everyone who offered..)
so that people can start looking at the data themselves!  These kind of
sizes are easily achievable using simple data reductions, (not that I'm
totally against compression, I just have some unanswered concerns about
the viability of fast search algorithms when using it). Let me inject
some facts about the data;

 1) The NASD CD-ROM as shipped wastes *HUGE* amounts of space because
of the format they used; fixed length data fields, and fixed length
records.  Many of the data fields are either *completely* blank,
or have very large numbers of leading or trailing blanks. As I
had posted before, more than 82% of the characters are blanks!

 2) By simply deleting leading and trailing blanks, and changing to
some type of variable length fields (using a field delimiter
character like ":" or "|") you can get this down to the sizes
you stated. Doing this makes a lot of sense to me, since extraneous
blanks get thrown out at some point, why not do it "up front" and
eliminate the burden placed on application writers.  If you delete
all of the blanks you get;

	{kd6ekq:1173}% ls

	aff.txt     com.txt     hpf.txt     nav.txt     readme.txt  wxl.txt
	apt.txt     fix.txt     ils.txt     oth.txt     ssd.txt
	arb.txt     formats     lid.txt     pfr.txt     sua.txt
	awy.txt     fss.txt     mtr.txt     pja.txt     twr.txt

	{kd6ekq:1174}% cat *.txt | tr -d ' ' | wc -c

	33602633

I know this not exactly correct, since in practice you only
delete leading, trailing blanks. But, an appprox upper bound.

 3) A quick inspection of any of the formats, shows that many fields
could be "compressed" by storing enumerations of the expected types. For
example, the airport type in the NASD file is one of: "airport", "heliport",
"gliderport", etc and this could easily be enumerated as 1, 2, 3, etc.
for significant savings.

 4) Here's a rough size of a minimal database, based on the current
fplan databases that are stored using a ":" variable field width
delimiter. If you eliminate all of the null chars used for padding,
you get;

	{kd6ekq:1166}% cat airports.nav vors.nav \
			| tr -d '\000'\
			| wc -c
	3227191

 5) Once you go to variable width fields, and eliminate all those
redundant blanks chars, the compression ratios you will get will not be
as high as what you get with the "RAW" NASD files. Using the fplan
databases as a simple benchmark, you get;

	{kd6ekq:1165}% cat airports.nav vors.nav \
			| tr -d '\000' \
			| bzip2 -9 - \
			| wc -c
	 884003

This works out to be about 3.65 to 1

>>    A general comment about database compression. Note that in the
>> case of the fplan databases, the records are padded to fixed lengths so
>> that a binary search can be used for the lookup by identifier.
>> [...snip...]
>Do you mean a Binary Tree type Search?

   Yes...

>The usual approach is to build an index and point it back at the
>file containing the data.
> 
> This has several advantages:
> You are not wasting all that space keeping fixed length records.

   Agreed, this is the way to go if you want to have the ability to
search on different fields, and there are certainly many very good
reasons to want this.

   I've already given consideration to switching to this type of
approach with the fplan databases. I'm game if people want to start
working on a spec for the files and fields. I wasn't suggesting fast binary
was the best choice, my point was (as you also stated), you can do much
better than simple sequential search. Further, it's not clear to me that
these sorts of algorithms work well with compression?

> [...snip...]
>In the case in question, that data can be compressed.

   This is where I was bothered. Once you find the entry in the index
for the record you want, you fseek() to the disk offset to actually
read the data record. I just checked the zlib docs and they DO mention a
gzseek(..., SEEK_SET) function which I wasn't aware of. Does it utilize
a sequential read for positioning (ie: BAD performance), or is there
some indexing of the blocks mechanism or the like that provides good
performance????

> 4 char Identifier and a 16 bit "Pointer" ~50K bytes to index
> the entire airport database.

   The identifiers for fixes and intersections are 5 chars long...

FWIW, I'll throw out another opinion. The KISS principle has never
failed me, and in this case it tells me that it would be foolish to try
and convert *ALL* the fields from the NASD database.  This would be a
BIG job, and might take too darn long, or worse, it might never get done.

I believe that it would be much better (and more timely) to start small
and work up in steps.  There are a lot of fields in the NASD files you
will never care about.  Start with the fields that, say fplan uses,
and add to them those that you know you could use, then add those you
think you *might* use, and at some point, forget the rest. My educated
guess is that when you are done, you'll still have less than 20 Mb of
real usable data without going to much trouble.

Regards, John

-- 
 ___|___  | John C. Peterson, KD6EKQ | Try Linux for x86, the operating system
  -(*)-   | mailto:jaypee@netcom.com | that gives *you* the right of fair use!
  o/ \o   | San Diego, CA   U.S.A    | See http://www.linux.org/ for info
-
Archives of linux-aviation: http://mail.nl.linux.org/lists/linux-aviation/
To unsubscribe: send the command "unsubscribe linux-aviation" in the body
of a mail message to <Majordomo@mail.nl.linux.org>.