What do I want to do?

In this demo, the objective is to extract the total amount (after discount, and after taxes) from a receipt image.

Some characteristics of these receipts:

  • There isn't a fixed format
  • It can be in different languages (English, S. Chinese, T. Chinese, Malay)
  • It can have more than one language
  • It can be an image from a camera snapshot, or an electronic file sent via email

How do I plan to do it?

I am using the Anthropic message API and Claude-3-5-sonnet-20240620 model, with a simple system prompt below:

Extract the total amount from the image. It should be a number, e.g. 100.50, usually next to the word 'total', 'total amount', 'grand total'. It can be in any languages. The currency symbol is RM


Test Results

Test #1: An A4 size receipt, itemized in table format.

Result: ✅ Success

Test #2: A 58mm thermal receipt paper, captured by phone camera. The text density is very high, font size is relatively small.

Result: ✅ Success

Test #3 - A landscape A4 paper, captured by a phone camera, some part of the image is malformed.

Result: ✅ Success

It seems the total amount is successfully identified in all 3 tests. Good job Claude 3.5!


Cost? And some improvements

The input tokens spent per attempt is about 1500-2000, which is probably equivalent to $0.005. Can we reduce it?

I tried to trim the white space around the image, but they are not useful. And I also resize the image to around 500px before sending it into the API. 

The input token is kept below 1000 and the success rate is still 100%. 

That's pretty awesome. :)