[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Data Compression




First off, It appears you took my statement in the spirit they 
were offered in. I'm glad to see that.

Snip

Good comments about removing wasted white space and going to variable
length recods, I agree.

> >>    A general comment about database compression. Note that in the
> >> case of the fplan databases, the records are padded to fixed lengths so
> >> that a binary search can be used for the lookup by identifier.
> >> [...snip...]
> >Do you mean a Binary Tree type Search?
> 
>    Yes...
> 
> >The usual approach is to build an index and point it back at the
> >file containing the data.
> > 
> > This has several advantages:
> > You are not wasting all that space keeping fixed length records.
> 
>    Agreed, this is the way to go if you want to have the ability to
> search on different fields, and there are certainly many very good
> reasons to want this.
> 
>    I've already given consideration to switching to this type of
> approach with the fplan databases. I'm game if people want to start
> working on a spec for the files and fields. I wasn't suggesting fast binary
> was the best choice, my point was (as you also stated), you can do much
> better than simple sequential search. 
There are better search methods than B-Tree in some cases, but they
offer so little relative improvement in a general case, I probably
wouldn't bother.

> Further, it's not clear to me that
> these sorts of algorithms work well with compression?

There are issues there, I admit.

My original intent was to be able to generate a delimited file
extracted from the original fixed length mainframe data and
write them out into more logical individual that are half way
"Normalized".

Then compress them with a library so that a person had the
option of reading them from the compressed state directly
using that library or running a decompression program on them
first.

With the way this kind of software is written and supported,
Ie. freely and for personal reasons, I don't think anybody
should dictate how programs are written.
This is not a job and a person should have the freedom to
code or design however they want to with no required justification.

Advice, speculation, brainstorming and suggestions are what this 
list should be about, if an author wants the input.

> 
> > [...snip...]
> >In the case in question, that data can be compressed.
> 
>    This is where I was bothered. Once you find the entry in the index
> for the record you want, you fseek() to the disk offset to actually
> read the data record. I just checked the zlib docs and they DO mention a
> gzseek(..., SEEK_SET) function which I wasn't aware of. Does it utilize
> a sequential read for positioning (ie: BAD performance), or is there
> some indexing of the blocks mechanism or the like that provides good
> performance????
That is something we need to look into.
Worst case, the way I see it there would be a decision made based
on how fast a sequential read is, and what the relative Pros & Cons
are of living with that speed, whatever it is.
looking at how often you would actually be doing a specific type of read.
Figuring how much work it would be to make it work differently.
Figuring the space required to simply store it in an uncompressed state.

SNIP
> 
> FWIW, I'll throw out another opinion. The KISS principle has never
> failed me, and in this case it tells me that it would be foolish to try
> and convert *ALL* the fields from the NASD database.  This would be a
> BIG job, and might take too darn long, or worse, it might never get done.
> 
> I believe that it would be much better (and more timely) to start small
> and work up in steps.  There are a lot of fields in the NASD files you
> will never care about.  Start with the fields that, say fplan uses,
> and add to them those that you know you could use, then add those you
> think you *might* use, and at some point, forget the rest. My educated
> guess is that when you are done, you'll still have less than 20 Mb of
> real usable data without going to much trouble.

I see your point in things getting finished.
I also see how high the payback is on converting or enumerating many of
the fields in the database, stuff that takes 12 chars and only contains
3 different values in the whole data set.

That is where I looked to compression, in a few hours you could pack the 
whole thing down and get much of the benefit size-wise without all the
work required to convert specific fields.

In the end, specific fields should probably be converted as someone
needs them, but I'd hate to start excluding anything from the dataset
because "Nobody will ever need it".

The OPTION of compression should remove most of the reason anyone
would have to remove anything for space reasons.

Sotra the "Save the space now, keep all the options open, and make
it better as we have time or reason" option.

And as I said before, all we need to do is include a decompression 
program with the data, and people can do whatever they want with it.

Marc
-
Archives of linux-aviation: http://mail.nl.linux.org/lists/linux-aviation/
To unsubscribe: send the command "unsubscribe linux-aviation" in the body
of a mail message to <Majordomo@mail.nl.linux.org>.