AWS Lambda Textract Alternative

Background

I have a LOT of old documents. In my brief time as a contractor as part of reporting my taxes the CRA (Canadas version of the IRS) required that I keep accurate records for any purchases that I wanted write off as a business expense for 5 years. Knowing this, I was quite neurotic about record keeping; a habit that stuck with me long after I moved on to full time employment. Fuel receipts, dentist appointments, groceries; you name an expense and I probably have a receipt or some kind of paper record saved for it somewhere in my stash. This led to a rather large stack of various records, both personal and business, piling up in a corner of the room I use as a home office. Keeping so many records means that while I'm generally pretty sure that I HAVE a record of something, finding that slip of paper is usually pretty difficult. Digitizing all these records seemed like a fun project, but I was always worried that the CRA wouldn't like it if I couldn't produce the original document. I have only recently seen that the CRA has shifted its stance on digitization, allowing the use of digital records so long as:

It is an accurate reproduction with the intention of it taking the place of the paper document
It gives the same information as the paper document
The significant details of the image are not obscured because of limitations in resolution, tonality, or hue.

So, I thought now might be as good a time as any to embark on converting all my old paper records into their digital equivalents. The goal here is to build a personal OCR / Document search engine that I can use securely across devices.

Disclaimers

I have not done much, if any, research into existing solutions for this because I'm approaching it as a personal project more than a business necessity. If it turns out better than expected, I may put some more work into refining the components.
I'm building this on a budget that can only be described as next to nothing. This is mostly to make it more of a challenge; most people could very easily spin up a service like this with a big budget.

Requirements

Security

These documents can vary in sensitivity from fuel receipts to medical records. For the most part, everything should be encrypted both in transit and at rest. No unencrypted data should ever be stored online, and the only way to get data from device to device is to authorize it via a previously used device. If any documents get leaked it might expose my crippling sugar addiction.

Tech Reqs

Documents will be coming through this service infrequently, so it's important to me both for price and overall system efficiency that the service not be running while its not in use. I also want to be able to test it locally in case of any weird document edge cases where the text being returned is not what I expected. Finally, while I do want a an accurate OCR implementation, accuracy is the lowest priority item right now because even decent quality OCR now is good enough for plain searching through documents if you add some kind of fuzzy matching.

OCR Engine

I'm a python developer by day and part of my job actually does happen to be interacting with different OCR engines. We've used some self-run solutions like Tesseract and some hosted solutions like Textract for products where accuracy matters more. For this project I'd love to use Textract but theres no free tier for it as far as I can see. So, I'll be rolling my own OCR API. As part of my job I did some analysis of different self hosted OCR solutions and as far as I could tell tesseract was the only OCR that had a low enough startup time for a transient service. The other solutions generally relied on a longer startup process that loaded their models into memory. Therefore, to manipulate the images before sending them into tesseract I'll be using OpenCV, and for the actual OCR I'll be using the python tesseract binding library PyTesseract.

Implementation

Containers

Both the image processing library OpenCV and the OCR library PyTesseract are not libraries that come in the default lambda environment. Complicating things even more, the OCR library I'm using is actually a binding around an executable, which means that will need to be available to the functions runtime as well. So, my options were to either upload a zip archive with the code and executables I wanted to run, or I could make a container that gets sent to ECR and run the lambda from there. In the end, I went with the container because it seemed a little bit cleaner and would likely give me a bit more useful experience.

Something thats important to note if you want to tackle building a container that will be run on lambda: you either need to build off of a lambda specific container base image. If you want to build off a custom base container you'll need to install a special "Lambda Runtime Interface Client (RIC)" that lets lambda call the code in the container. There was precious little information regarding the RIC, so I didn't dig too far into it.

I wont detail the full process of creating the container; there was a lot of swearing and trying to figure out what the alpine linux versions were of specific Pillow / Tesseract build requirements were. In the end, though, the Dockerfile looked like this:

FROM python:3.9-alpine
# Install dependencies
# This should probably be moved to a builder image
RUN apk add --no-cache build-base \
    jpeg-dev \
    zlib-dev \
    tesseract-ocr \
    cmake \
    g++ \
    make \
    unzip \
    curl-dev \
    autoconf \
    automake \
    libtool \
    libexecinfo-dev 

RUN pip install awslambdaric --no-cache-dir
RUN pip install pytesseract --no-cache-dir

# Copy in the app function code
COPY app.py /

# Heres where AWS will call the fn
ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ]
CMD [ "app.handler" ]

All in, our container weighs a total of 438MB on the building instance, which for me was a t2.micro that I got for free under the AWS free trial. Once pushed to ECR using docker it actually only appeared as 159.5MB which I suppose means that they have one or more of my layers cached for free?

Lambda

Lambda Function

With the container ready to go, I began testing different functions contained in the app.py file. Including the RIC in your container allows you to test everything locally using a Lambda Runtime Interface Emulator. Heres how I was running the emulator during most of my testing:

docker run -v ~/.aws-lambda-rie:/aws-lambda \
    --env AWS_LAMBDA_FUNCTION_MEMORY_SIZE=512 \
    -p 9000:8080 \
    --entrypoint /aws-lambda/aws-lambda-rie \
    <container>:latest  \
    /usr/local/bin/python -m awslambdaric app.handler

This was super helpful for testing different memory allocation sizes, because that would be my main factor in determining the OCR cost per page. Weirdly enough, the local url you have to make the request to is constant, and seems to be from around the time that they were first developing python lambda functions:

http://localhost:9000/2015-03-31/functions/function/invocations

In the end, anything under a 512MB allocation would lead to an error when initializing whatever models tesseract was using under the hood. Because we're using a relatively low amount of memory, AWS will also give us a proportionally lower share of the CPU's time. In the end, the average of 5 seconds processing time per document isn't the worst. You might be thinking "But what if I submit a really big image, wont that make it really slow". Well yes, but there is a max request size for lambda function of around 6MB so there is a hard upper limit on the size of image that you can extract from. Generally I've found good results for images that are at least 2000x2000 which comes in well under the limit with any good image format.

API Gateway

Lambda functions only get you 90% of the way to a working API; a lambda is just a piece of code that gets run on a 'trigger'. Part of the creation of a lambda function is defining what that trigger actually is. In my case, I want my lambda function to be triggered on an HTTP(S) request. So I need to create a new endpoint from which we can actually request the extraction of a documents information. The only way that I'm aware of to do this is through AWS' API Gateway service. To set this up I followed their documentation on setting up an API gateway with Lambda.

Code

The code for the lambda function was quite simple, only made longer by a lot of error handling I had to put in when debugging things. You can find the full app code in this gist, but the main handler looks like this:

def handler(event, context):
    try:
        body = json.loads(event['body'])
        image_bytes = body['image'].encode('utf-8')
    except Exception as e:
        return {
            "error":"Could not get image from request",
            "exception":str(e),
            "event":event
        }
    try:
        img_b64dec = base64.b64decode(image_bytes)
        img_byteIO = io.BytesIO(img_b64dec)
        image = Image.open(img_byteIO)
    except Exception as e:
        return {
            "error":"Error decoding and opening image",
            "exception":str(e),
            "event":event
        }
    #result = pytesseract.image_to_data(Image.open('ocr-test.jpg'))
    try:
        result = pytesseract.image_to_data(image)
    except Exception as e:
        return {
            "error":"Error in tesseract",
            "exception":str(e),
            "event":event
        }
    return json.dumps({
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps(tsv2json(result))
    })

Very Important Note When you're using the API Gateway to trigger a lambda function, any JSON that you send to the endpoint gets turned into a UTF-8 string and put in the body section of the event. You might have noticed in the code above that I had to call json.loads(). This was the source of a long debugging session because I was used to a flask<->gunicorn combo where the request data comes in directly as a dictionary. It's also passed directly in as a dictionary when you're calling the lambda function directly with the RIE, or when you send in a 'test' from the online lambda console.

Appendix

Extra Docker Commands

Authenticating Docker With ECS

aws ecr get-login-password --region ca-central-1 | docker login --username AWS --password-stdin <ECS URL>

Tagging and Pushing Your Built Image

docker tag <IMAGE ID> <ECS URL>
docker push <ECS URL>

Who Else But Me?