For a couple of months, we decided to start tracking where our money got spent. We felt the need to see if we could do a little better each month. And maybe if we collected this data for a few months, we may see a spending pattern emerge that would help us budget our expenses way better going forward. There is just one hiccup: data entry. I still went ahead and did this for two months. I have excused myself this month and will resume for November and December. There has to be a better way, right?
I have not tried looking up online to see what other people are doing. I am just trying to list out some options.
An ideal situation is when the biller can send me an itemized bill in a format that is easily parseable. Quite a few vendors, especially due to online shopping and mCommerce, present an itemized bill to our emails. But they would send us out a PDF copy. I haven’t tried to parse those PDFs yet. If this works, then perhaps this is the best place to start. It might work because these PDFs are not scanned images and hence may have enough data to recover the text correctly. In fact, there is a python module to do this. It will leave us to handle a lot of post processing, but you can look at that as data science and maybe tools like pandas can simplify extracting structure from unstructured data. So, one possible direction: get itemized bills as PDF that can then be parsed for structure and data. There could be multiple parsers, almost one for each vendor.
But not all vendors send out PDF receipts/bills. And even when they do, I need to come up with extensive scripts (IFTTT type) to extract attachments from emails that are possibly receipts/bills and copy them somewhere accessible (Dropbox/icloud/google drive) and run relevant parsers to extract data into a database. At least, the latter is a possibility. But many vendors don’t send out PDFs. And with all the privacy issues around, people are not going to give their email addresses to every vendor out there. What guarantees that the emails won’t be sold to marketing/ad campaigns? None!
For vendors that cannot send out PDFs, there are two alternatives. One is scanning the receipts into an image and running variants of this receipt-parser. The other is more expensive but a nicer solution. First, the receipt parser direction needs a lot of customization. And some grocery receipts tend to be so long, getting them into a legible image is going to be a task. So parsing multiple such images and maintaining context is going to blow up the complexity of the script. But since this is entirely in our control, it is a workable solution.
The second alternative involves adding a small gadget to the receipt printer connection on the vendor’s side. A gadget that can sniff the data being sent to the receipt printer and stream it over Bluetooth or WiFi or NFC to a companion app. You may be able to pair them anonymously and temporarily so that the pairing does not automatically happen whenever you walk into the store leading to privacy concerns. The companion app receives the stream and generates a PDF that we can use just the same as PDFs generated by vendors. Or maybe we could add enough mojo to the app to parse the stream and generate an itemized/structured text or CSV that can easily feed into standard home budgeting or todo list apps. I personally like this approach since it involves technology, but I don’t know if we’ll be able to convince any vendor to buy another gadget to his billing lineup or to replace his receipt printer with something that simplifies receipt processing task for a small percentage of his consumers.
Bottomline: unless there is a good reason to have itemized bills shared over a paired secure connection, the only working solution to my problem is to parse PDFs in an extremely roundabout way. I wonder if there is a business hidden amidst all this.