I have extended the camera mounting brackets by an extra 40mm, so I now can photograph a full A4 sheet of paper. The RPi is now mounting a NFS share on my home QNAP, so no data files are now stored on the Doc-Pi.
This is now turning more into a EDMS (Electronic Document Management System) and software project. I do have to convert the breadboard into a permanent circuit board, but the rest is kind of software.
I have lots of ideas on this and have pondered using the community version of EDMS packages like Knowledge Tree (
https://www.knowledgetree.com/) or Alfresco (
https://www.alfresco.com/community). After a quick bit of digging around I see that someone has created the Python based Mayan project (
http://www.mayan-edms.com/). Part of me - the masochistic side wants to write a python based home brewed system, but if I only stick to using DOC-Pi for home paper work/Invoices etc. then really I am only looking at perhaps 100,000 to 200,000 documents. The Drupal CMS (
https://www.drupal.org/), can easily handle this number of document and I know this package very well and it's fun to work with, so maybe I should try it out first.
At the moment though, my current thinking is that I am only going to use this to store photographed copies of invoices and document etc. These will be OCR'ed (Optical Character Recognition), into a text format and then the text will be partially processed via NLP (Natural Language Processing) Python package: NLTK (
http://www.nltk.org/).
A few tips for people interested in this kind of thing. To OCR an image, you will need to install Tesseract (
https://code.google.com/p/tesseract-ocr/). On a RPi you just need to run: sudo apt-get install tesseract-ocr
You can then OCR a jpeg file like this: tesseract 2015-11-14_182550.jpg 2015-11-14_182550
On a RPi 2, this can take 2-3 minutes per 500KB jpeg file (1900px by 2500px). The smaller the file, the faster it is. The more pixels per 'character' the better quality the OCR is, or less errors. It's good enough for what I need for the moment.
To install NLTK (Natural Language Tool Kit) software packages: sudo pip install nltk (if you don't have it, you need to install pip first: sudo apt-get install python-pip). Best then to download some supporting files:
python # run python from the RPi CLI
nltk_download() # pops up a menu, which I used to downloaded these files:
averaged_perceptron_tagger
punkt
wordnet
words
toolbox
brown
Here is a simple exploratory code to get you started:
Code: Select all
#
# this will import a text file and do some NLP (Natural Language Processing) on the words
# import all the python modules we need
import nltk
from nltk.corpus import wordnet
path = nltk.data.find('/home/pi/ocr/2015-11-14_182550.txt')
print "load file"
raw = open(path, 'rU').read()
print "decode"
words = raw.decode('unicode_escape').encode('ascii','ignore')
print "tokanize"
text = nltk.word_tokenize(words)
# analyse and tag the tokeized list of words
tag_text = nltk.pos_tag(text)
minlength = 3 # ignore words shorter than this
nouns = set()
pnouns = set()
verbs = set()
adjectives = set()
number = set()
symbol = set()
foreign = set()
url = set()
# this will create a list of various types of words and check that they are legitimate words
print "start processing:"
for word, pos in tag_text:
if len(word) > minlength:
# find web URLS
if len(nltk.regexp_tokenize(word, r'(http://|https://|www.)[^"\' ]+')):
url.add(word)
if pos in ['SYM']:
symbol.add(word)
if pos in ['FW']:
foreign.add(word)
if pos in ['CD']:
number.add(word)
#print word
if not wordnet.synsets(word):
print word
else:
if pos in ['NN']:
nouns.add(word)
if pos in ['NNP']:
pnouns.add(word)
if pos in ['VBZ','VBG']:
verbs.add(word)
if pos in ['JJ']:
adjectives.add(word)
print "Nouns: ", nouns
print "ProNouns: ", pnouns
print "Verbs: ", verbs
print "adjectives: ", adjectives
print "Numbers: ", number
print "Symbols: ", symbol
print "Foreign: ", foreign
print "URLs: ", url
# list out some frequency stats
print
print "Frequency Stats:"
fd = nltk.FreqDist(text)
print fd.keys()
After about 60 seconds of processing, the above script will display a lot of interesting info on the text file you fed into it.
At the moment, for my document 'keywords' I am going to go with the first 20 Nouns and Pronouns and any URLs that might be listed in the doc. It's very arbitrary and might change once I have a solid block of sample docs OCR'ed etc. Ideally, I would like it to extract pertinent info based on the kind of document, ie: a bank statement might only record date, bank name, account number, balance etc; Electricity statement might have date, company, amount, avg daily units used; an Insurance doc might record date, company, amount, policy number and policy update text etc. All of this info will be stored in a SQL DB, with a web interface.
Hopefully this will end up doing a bit more that an Evernote App.