What's on Jason's Hard Drive
Pages: 1, 2
Step Four: Other Assets
Under Perforce I keep a lot more than just scans. Each employer or client I've had gets a subdirectory related to that employment and the work I've done there. By convention I give each employer or client an immediate biz subdirectory holding the electronic paystub or PO and invoice records along with scans of any agreements. This is one of those cases where maybe those things should go in misc-legal. When in doubt, soft link. Try that with paper!
These days electronic books are becoming popular, so I created a /perforce/ebooks area. In there I keep for example ebooks/pragprog/pragmatic-automation.pdf, a book from my friend Mike Clark.
I've long had a /perforce/hacks directory to hold little personal hack programs I don't want to lose. It holds the code from my freshman year poker-playing game (written in Pascal!), my senior year lread program (to read a Linux filesystem from DOS), and of course several recent handy XQuery utilities. It's also a good repository for hacks friends share with me that aren't public. By putting the files into Perforce, they easily move with me during machine upgrades, while other little-used files get left behind.
Ever since my first digital camera I've had a /perforce/photos area to hold any digipic I would be upset to lose. They take up space (raw space times three because of my replicated system) but that way I know I'll always have them. Too many people lose all their photos when a hard drive fails. Photos that don't make the Perforce cut get stored on an external drive hanging off the Mac server. It's a single point of failure, but oh well. I didn't like them much anyway.
Under the /perforce/photos directory I keep subdirectories oriented by date. For example, photos/20050506-hawaii holds images from a Hawaiian vacation in May. Sometimes when various people take pictures of an event, I put their copies in subdirectories of their name. I use ACDSee to view my photos. It makes is easy to view directories or groups of directories at a time. I wish iPhoto on my Mac would understand hard drive organizations more.
Last but not least, I keep a /perforce/writing subdirectory. It holds things like, well, this!
Here's a skeleton view of everything I've described. A quick check shows I presently have 28,000 files under Perforce with 1,500 of them under scans, so naturally what you see here is just a wee sample:
perforce/
scans/
financial/
taxes/
20030519-form5498.tif
loans/
20041111-acura-tl.tif
fidelity/
20060531-statement.pdf
vanguard/
20060531-statement.pdf
receipts/
personal/
2005/
2006/
donations/
2005/
2006/
reimbursed/
2005/
2006/
selfempl/
2005/
2006/
20060224-cell.tif
autos/
2004-tl/
20060511-15k.tif
house/
misc-legal/
fun/
ebooks/
pragprog/
pragmatic-automation.pdf
hacks/
lread/
nedpoker/
photos/
20050506-hawaii/
IMG_0447.jpg
writing/
javanet/
javaworld/
oracle/
sgi/
biz/
marklogic/
biz/
Last Tip
Last tip: I've had a great experience keeping a "work journal" file (stored under Perforce of course). My journal file consists of two parts, separated by an easy-to-search-for and appears-nowhere-else marker such as ----. Above the line I place past accomplishments in chronological order organized by day. Below the line I list my to do items, priority ranked so the more urgent ones are on top. To be future-proof and OS-resistant I keep the file as simple text.
I started the journal the day I met Tim O'Reilly and received the offer to write the book Java Servlet Programming. It helped me track my progress (I averaged one chapter every three weeks), record discoveries, and remember people I needed to talk with and their contact information (I can't put everyone in the Palm Pilot). After a hard day, the journal shows me exactly what I accomplished, and years later the record is still there. By keeping the record in the virtual world it seems more real to me. Go figure.
Sometimes I feel like my best job description is someone who moves things from below a line to above it, eternally changing "to do" items into "how I did it" entries. I do search back on the entries quite often to remember things. The steps to do a JDOM release, how to move funds to Fidelity, and the location where I found the cool tcsh shell for Windows are all things I look up. As of today it's over 440,000 words in a 2.6 MB text file. It seems that I average 250 "words of work" per workday.
I hope my habits and conventions can inspire you to dig out of the paper pile in your office. If you have ideas to share, I'd like to hear them. Now, if you'll excuse me, I have to go and note in my work journal that I finished this article!
Jason Hunter is Principal Technologist with Mark Logic and the author of Java Servlet Programming.
Return to ONLamp.com.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 22 of 22.
-
document scanner
2009-03-19 12:54:05 Tom L [Reply | View]
-
document scanner
2009-03-19 18:03:44 jchunter [Reply | View]
Yes, recently I picked up a Fujitsu ScanSnap and it's terrific. I have the S510M model (the M stands for the Mac edition). It's super fast, has a small footprint, does duplex by having two sensors read both sides at the same time, includes good OCR software, and is as easy as pushing a button. It's gotten great reviews on Amazon and it deserves them.
It writes PDF files instead of TIF files, which is a bit easier for non-techie people to read. If you run the OCR it embeds word bounding boxes in the PDF so you can search for words and even copy/paste highlight the words in the PDF even while the pages look like the scanned version. That's crazy cool.
Only downside is there's no flatbed scanner so if you want to scan an open book, for example, you can't. You could xerox it at work and scan it at home though.
-
Windows based CVS?
2008-02-13 13:02:09 ColoradoSkier [Reply | View]
I like this idea, and would like to use a CVS that works on Windows. Any suggestions?
-
scanner suggestions?
2007-02-22 17:56:29 tangentialist [Reply | View]
I'm trying to scan years of accumulated files and writing, and could use recommendations on a good (but not prohibitively expensive) ADF scanner. I keep going back to the ScanSnap, but it's kinda pricey.
What does everyone else have? Likes/dislikes about it?
-
Scanning software?
2007-02-22 11:58:47 dforemsky [Reply | View]
Does anybody have a recommendation for 3rd party or open source scanning software?
I have a scanner, but the software that came with is extremely frustrating to use; every time I scan a document, I have to walk down the directory tree to the save directory - the software does not keep the location of the last save. This feature alone would make by scanning much more efficient.
Thanks!
-
Metadata
2007-02-22 11:00:08 chrismelikian [Reply | View]
Why not have a .txt file for every .jpg/pdf etc with the same file name which would represent the metadata. Then when doing a search you can search the text and find the folder which contains the file. Might be a tad clumsy but should work.
-
Scanr
2007-02-22 08:04:39 jmann_au [Reply | View]
When scanning, think about scanning to a JPEG file then emailing the result to an online service like www.scanr.com who will then email back to you a multipage PDF document with the text included so you can search. I have found the text recognition capabilities of Scanr brilliant and it can only improve.
Its easy to create a simple workflow that does that for you, then if you are really keen use a bayesian (popFile) email filter to do the filing work for you ;)
-
Good idea, but scanner prices
2007-02-22 07:42:43 aakoch [Reply | View]
I'd love to do something like this, but the cost of a dual sided scanner had always turned me away from it. I just can't afford a $500 scanner for this sole purpose.
-
Great article
2007-01-08 00:40:57 jpmcc [Reply | View]
Jason, many thanks for this inspirational article. It convinced me to do something similar (http://www.mealldubh.org/index.php/2006/12/31/filing-tonight) during our annual end of year domestic file purge. I've also started a project on SourceForge (http://phpmyarchive.sourceforge.net/) to create an entry level document management system for home / small office use.
Thank you!
-
Organizing my stuff
2006-12-01 17:53:22 KarlVogel [Reply | View]
I had similar problems when trying to organize my files. If you're interested, here's how I handled it:
http://www.dnaco.net/~vogelke/Software/Time_Management/Work_Environment/
-
Using a Digital Camera as a Scanner
2006-11-08 12:33:01 alex() [Reply | View]
If you put a digital photo camera on a tripod, looking down onto your table, you can "scan" documents much quicker than with a scanner. You can practically "scan" as quick as you can flip the pages if you have some sort of remote control for the camera. Mark the rectangle that the camera can see on your desk and it's easier to position the document to be scanned.
This setup is also fine to quickly get hand-drawn illustrations into the computer, and is much faster than creating the same illustration with a computer program.
By the way, TIFF is an open format, without lossy compression, so in theory you should be able to find/write software that can read the files in 10-20 years from now.
-
why TIFF
2006-11-08 08:28:32 rothko [Reply | View]
Is there any particular reason you are saving them as TIFF files?
-
Paper paper paper
2006-11-07 12:38:44 alex_on_the_boat [Reply | View]
I’ve been using a simpler system for several years. Not necessarily better.
I’m so glad to see this article and replies since I now know I am not alone in doing this.
Living on a boat and hence small space, space is at a premium. Digitising paperwork makes it so much easier.
This has expanded to take in magazine articles too.
Now it includes films and music plus photo’s.
The caveats are the same.
The investment in a decent scanner hurts and becomes a business class model when fitted with an Automatic Document Feeder. You need this.
The time it takes me to scan the documents seems to far outweigh the value of the information. Then when I need something, being able to find it is priceless.
Out of 100% of the information only 20% or less has any future relevance.
However when it came to a matrimonial divorce the value of the information far outweighed the apparent effort at the time and left me aghast with what I was able to provide and prove. Don’t marry someone who runs this system!!
I gave up with OCR as the additional time and correction penalty was more than I could handle.
The real issue is being able to search for stuff.
A filename won’t contain enough information and remembering how and where you stored stuff for me became a problem to which I haven’t found an answer. This means ‘metadata’ is also required. Soft links sounds interesting.
The Adobe system sounds interesting.
The utopia as I currently see it is to be able to search information and report on it.
That is how many minutes I spend talking to Jane on my mobile phone over the last 4 months, how much I spent on fuel for the car against the last car, how much on food. To do this without having to enter the information into many silos and then design reporting on it would be superb. I want to live life, not spend life scanning stuff or sat before that and ‘Quickbooks’.
I figure some sort of SQL engine would be it, but not across TIF’s or scans but across the information contained within those scans which is OCR’d metadata.
-
Why use the revision control system?
2006-11-06 17:52:36 misko [Reply | View]
For the documents you are scanning in, how often do you really take advantage of the features Perforce gives you? I do something similar (check out the Fujitsu ScanSnap for an inexpensive 2 sided sheet feed scanner), then OCR using Acrobat. Acrobat has a catalog feature to make all pdfs searchable, or something like google desktop allows them to be searchable. I scan in all my bills (that I can't receive electronically) and then shred rather than save them in a box.
But back to my real reason for writing. I do all this with just a directory hierarchy. I'm not sure what is gained by using a revision control system for things like bills, etc. that don't tend to have revisions. You get one per month and that's that.
Thanks, nice to see others think the same way.
-
Why use the revision control system?
2006-11-07 08:41:24 p2pvoice [Reply | View]
Jason, this article may indeed change my life. I constanly struggle with how to organize information - what you techies call taxonomy. The beauty of your heirarchical file structure and naming convention is its simplicity. I found this aspect of article more useful to me than the fact that you use Preforce!
That leads me to two questions:
1. What benefits do you get from from the "revision control" aspect of Preforce?
2. Is there a similar "personal document manager" that would do the job?
Thank Jason.!
(I'll be thinking of you all through the holiday season when I commit myself to organize virtually. -
Why use the revision control system?
2006-11-07 14:52:37 Jason Hunter |
[Reply | View]
Hi,
There are several reasons I've evolved my system to use a revision control system.
1. First and foremost, not all files in my repository are write once. Many are, but not all (think resume.doc). Plus I store my active coding projects in the repository.
2. Easy replication. Makes it easy to have copies of each file in every client. Something like a RAID or backup system doesn't help here. (The job for a RAID is to host the Perforce server.)
3. I want something that works across operating systems and across decades. Revision control is well understood, reliable, prevalent, and here for the long term.
-jh-
-
DM populi
2006-11-04 12:04:52 gavin@terrill.com [Reply | View]
I used to work for a document management software vendor, and they espouse the value of not putting things in folders but instead indexing using metadata. In theory this loose coupling gives you greater flexibility and search power, but in practice people (and today's tools) prefer hierarchies. Plus the cost of capturing the metadata is rarely recuperated. The directory structure presented here looks straight forward and simple - much better!
Anyway, re OCR, it will be interesting to see what happens with with the tesseract project now that a couple of googlers are on board. Coupling that with Lucene would get you some ways towards a solution.
-
similar system
2006-11-04 03:17:29 jhealy [Reply | View]
Joey Hess has written an article on a similar system he has using subversion - although he doesn't mention scanning of meatspace documents.
http://kitenet.net/~joey/svnhome.html
-
Text recognition
2006-11-03 11:39:08 atotic [Reply | View]
I do something similar, but with automatic text recognition. Check out Scan2PDF. They offer an OCR plugin for around $200. If you scan at 300 dpi, the OCR works really well. It creates a transparent text document on top of the scanned image. The added bonus is that everything is stored as PDF, and searchable by Google Desktop, etc.
http://www.scantopdf.co.uk/
Aleks
-
Form-Filling
2006-11-03 06:12:21 danyul [Reply | View]
When I saw your post I was hoping you'd provide some tips on how to do something I've been wondering about for some time. Mainly, what are the best methods to scan a form in and then be able to mark the user-fill-outable areas so that I can then fill out the form electronically? I'm wondering about both free and proprietary solutions, on Linux or Windows. Any ideas? -
Form-Filling
2007-01-16 18:26:34 techywiz [Reply | View]
I wanted to edit my PDFs too, and instead of buying expensive Adobe Distiller I got something for $19.99 called PDFill and a bunch of free tools included. Check out www.pdfill.com. I have used it with great success to do a simple task. -
Form-Filling
2006-11-06 18:09:37 misko [Reply | View]
For form filling, I use Acrobat 7 Pro, scan in the document, then go through and mark up the user fill areas. It's a slow process though, but then I'm no expert at their form tool. Perhaps someone with more experience with it can comment if it gets easier.







The most recent example is my spending two days searching for some legal docuements, with only partial success. And literally getting sick in the process from dust etc. My problem is I never like to disgard anything. I have been a lawyer, an historian, a secretary with several organizations, And I have a concern for actually preserving past records. If I had just had a good secretary instead of a bad marriage ... well, lets not get into that.
In an ideal world I would like to be able to scan just any paper I get. Obviously most paper is not worth keeping in the original form - the copy in digital form is just as good. If I had started with this 20 or thirty years ago, I might actually have had more of a life over those years. And I might actually fidn those papers I do nned in their original form like my mother's will ... luckily she is still healthy at 94