PDA

View Full Version : Printed Document Availability


Erik
July 23rd, 2003, 08:48 AM
I mentioned this documentation registering effort on the ccTalk mailing list and got the following response (amongst others):

Hi,

>> I finally have a scanning system setup here for archiving documents.
>
> On a tangentially related note, we've just started an effort at the VC
> Forum to track scanned documents.

Ahh, but why limit yourself to just scanned documentation? In terms of systems
preservation, typically I imagine that the documentation that *doesn't* get
scanned is the more useful as it's for the rarer machines and harder to come
by. Problem is there's less incentive to scan documentation for machines with a
low production volume, and for more complex or specialist machines the
documentation can be huge (I catalogued all the Torch stuff I have and there's
well over 13,000 pages - no way I'm scanning that! :)

Collectors might not be willing to scan in everything they own - but they might
be willing to make it known that they have paper copies of xyz and so therefore
could look up information on somebody's behlaf if needs be. Could be invaluable
for bringing less common machines back to life again.

This won't work for those of us who constantly trade machines back and forth
(and there's nothing wrong with that!) but I imagine lots of us have
collections that only ever get added to, or have machines that won't (likely)
ever be traded or sold on.

Experience has been that the classiccmp list - whilst invaluable - isn't always
the best source of information, plus posts with questions get missed etc.

> Think of it as an index to available online documents of interest to
> vintage computer collectors.

just drop the 'online' bit :-)

What data do you actually store? I ended up with the following for my stuff:

Related manufacturer
Page format / type
Issue / date
Author
Notes
Location
Quantity
Source
Part number
Size (pages, approx)

Most of those are optional. 'Location' is just something I use to tell me where
things are when they're stored in a binder or whatever - for a system used by
several people it could be dropped (or kept private from other users). 'Source'
tells me where I got xyz from and when - I've found that to be useful to know
in the past. Again, could be private data. 'Size' is handy to know for when
somebody asks whether you could scan something - gives a good idea of effort
involved!

For a system shared between users I'd probably add a 'Related machine' column
too, and it'd of course need an 'online location' field and some sort of user
contact details too. Some of those fields would be common to multiple entries
for the same document, others on a per-item basis. ('date entry added' might be
nice too)

I only thought of this about a month ago and have been too busy to make a start
on it other than run a few ideas by Tony (from the classicmp list). Initial
thought was to use something like Hypersonic as the database; the software
footprint is only a few hundred KB, plus it's Java so portability is less of a
problem as is interfacing to some sort of web-based system.

One step at a time and all that, but of course it doesn't end with
documentation, but could also be extended to systems, software, ROM images and
the like (a lot of ROMs must be close to failing in classic machines these days
and not many people make an effort to archive those!)

Put these thoughts on your site if you think it makes sense; I'm happy to
bounce ideas around with people.

Getting people to actually submit data is of course the hard part :-) I
imagine those with rarities are the ones who'll be interested in this, and
they're precisely the people who need to be attracted to an effort like this.

> It's just in its infancy, but I think it's a great idea

Same here. I just think limiting things to online data doesn't help the
preservation movement as much as it could - but it does help those with
more-common machines who want to get a bit more out of them.

cheers

Jules


I think that this is an excellent extension of the idea, so this new forum has been created for folks to post available documents that they haven't yet, or may never scan. Please only post documents that you'd be willing to either copy or lend out for copying to assist another in need.

Erik

SwedaGuy
February 26th, 2007, 09:51 AM
I, too, have grappled with the problem of cataloging and preserving technical documentation.

I currently have a collection estimated at 11,000 documents and 125,000 pages. Scan it? Sure....

Cataloging it sounds like a more realistic approach, and I think I've found a program to do it. I company in Great Britain puts out a package called LexFile, which stores data in MARC (MAchine Readable Cataloging) format, the format used by 95% of libraries in this country as well as the Libarary of Congress.

There are a lot of programs that will handle the MARC data, but for me it can't be a Windows program, and the Lexfile is DOS based so it will run under my OS/2 network with no problem. The fact that the DOS version is also free was just a bonus. I would gladly have paid for it after reviewing it.

It should also be noted that the MARC format accomodates widely varying data, not just books. They have catagories for physical items (such as EPROMS some else mentioned) and intellectual property (such as source code, regardless of the media format).

If you haven't worked in a library (I did, in school) it may be a bit confusing, but I would be happy to answer any questions I can.

The most important thing to stress is consistancy and standards. It might be in the best interest of a few like-minded professionals to found an organization dedicated to the task of preserving this important history. My personal goal is to have my catalog on the internet, so that other people can google a particular model and see that I have the book they want. Imagine if a few of us who have larger collections could get together and set down standards for cataloging...

Standard abreviations for manufacturers, product lines, OEMs, etc.

Standard Media Type Classifications: (Paper hardbound, paper softbound, Microfiche, etc.)

Standard Distribution Types: (Sales Brochures, Service Documents, Programming Manuals, Users Guides, etc.)

Sharkonwheels
May 25th, 2007, 09:00 PM
I'd rather set up a doc mgmt system, scan x amount per week, say one manual a week/day/whatever. Something based on MS SQL Server/MSDE would work great, or MySQL, free sybase SQL servers, etc..

Wouldn;t be too hard to get something done, even using Access to create the dB in MSDE.

The main problem, is the originals are aging, and getting worse.

Unfortunately, when they're gone, they're gone.


Tony

mbbrutman
May 26th, 2007, 05:51 AM
I'm not just interested in scanning, but doing OCR as well. Having the text of the things I scan searchable is important.

Has anybody looked into the current state of the art for OCR packages? I'm sure that Adobe has something good, but I generally can't justify spending their kind of money for a hobby project like this.

carlsson
May 26th, 2007, 04:45 PM
I don't associate Adobe with OCR software. More likely Paperport, or whoever OmniPage comes from. I've tried some OEM versions that come bundled with scanners. They generally are good if the source is readable and mostly text, but as always it is a bit of post-processing. In particular if the documentation contains tables, illustrations and other pictures. Once the document is finished, you may want to save it as PDF since it is the least proprietary among proprietary formats that maintains layout and images. Something HTML-ish might work too, but more fiddly to download.

Sharkonwheels
May 26th, 2007, 09:44 PM
I'm not just interested in scanning, but doing OCR as well. Having the text of the things I scan searchable is important.

Has anybody looked into the current state of the art for OCR packages? I'm sure that Adobe has something good, but I generally can't justify spending their kind of money for a hobby project like this.

Acrobat files are searchable. I scanned in a PC-MOS Troubleshooting Guide, and when I clicked search, it asked if I wanted it to build the database, it did (this scans all pages and OCR's them and adds a db of words) and done.

OCR alone wouldn;t work for me, as alot of the docs I have also have images, etc... Do OCR programs add them in? Or just import text only?

When i say images, I mean important stuff, like layouts, inter-connections, system diagrams, etc...


Tony

mbbrutman
May 27th, 2007, 06:14 AM
Which version of Acrobat includes the OCR feature?

Sharkonwheels
May 27th, 2007, 06:21 PM
I'm using the Acrobat Standard 7 that came with my Fujitsu 5110EOX2 scanner from work. When I scanned in, it was just images. When I tried searching, it said it needed to OCR it (or something like that). As each page was processed, you could see it's progress messages - 'skewing page', scanning for letters, scanning for words, running OCR service, etc..


Tony

SwedaGuy
June 13th, 2007, 07:49 AM
I actually purchased the acrobat distiller, paid around $900.00 for a 10,000 page license. I've never gotten around to starting the project. Well, for starters, I need to find a decent scanner.

I agree that the quality and availability of source documents is declining, so I suppose time is of the essence. But I still think there should be some kind of standards in place for doing it.

lynchaj
June 13th, 2007, 09:58 AM
My recommendation is to coordinate with the folks from bitsavers.org since they have already plowed through all these issues and have established standards on how to scan documentation, etc.

Al Kossow is on this forum some place and maybe he can chime in. The people at bitsavers.org have an excellent system in place and I would make any solution for the problem consistent with what they have already done.

Thanks!

Andrew Lynch