What do I want to do?
In this demo, the objective is to extract the total amount (after discount, and after taxes) from a receipt image.
Some characteristics of these receipts:
- There isn't a fixed format
- It can be in different languages (English, S. Chinese, T. Chinese, Malay)
- It can have more than one language
- It can be an image from a camera snapshot, or an electronic file sent via email
How do I plan to do it?
I am using the Anthropic message API and Claude-3-5-sonnet-20240620 model, with a simple system prompt below:
Extract the total amount from the image. It should be a number, e.g. 100.50, usually next to the word 'total', 'total amount', 'grand total'. It can be in any languages. The currency symbol is RM
Test Results
Test #1: An A4 size receipt, itemized in table format.
Result: ✅ Success
Test #2: A 58mm thermal receipt paper, captured by phone camera. The text density is very high, font size is relatively small.
Result: ✅ Success
Test #3 - A landscape A4 paper, captured by a phone camera, some part of the image is malformed.
Result: ✅ Success
It seems the total amount is successfully identified in all 3 tests. Good job Claude 3.5!
Cost? And some improvements
The input tokens spent per attempt is about 1500-2000, which is probably equivalent to $0.005. Can we reduce it?
I tried to trim the white space around the image, but they are not useful. And I also resize the image to around 500px before sending it into the API.
The input token is kept below 1000 and the success rate is still 100%.
That's pretty awesome. :)