Who Else But Me?

Inexpensive Receipt Repository - OCR

Simmo — Mon, 21 Feb 2022 23:41:21 GMT

Background

I have a LOT of old documents. In my brief time as a contractor as part of reporting my taxes the CRA (Canadas version of the IRS) required that I keep accurate records for any purchases that I wanted write off as a business expense for 5 years. Knowing this, I was quite neurotic about record keeping; a habit that stuck with me long after I moved on to full time employment. Fuel receipts, dentist appointments, groceries; you name an expense and I probably have a receipt or some kind of paper record saved for it somewhere in my stash. This led to a rather large stack of various records, both personal and business, piling up in a corner of the room I use as a home office. Keeping so many records means that while I'm generally pretty sure that I HAVE a record of something, finding that slip of paper is usually pretty difficult. Digitizing all these records seemed like a fun project, but I was always worried that the CRA wouldn't like it if I couldn't produce the original document. I have only recently seen that the CRA has shifted its stance on digitization, allowing the use of digital records so long as:

It is an accurate reproduction with the intention of it taking the place of the paper document
It gives the same information as the paper document
The significant details of the image are not obscured because of limitations in resolution, tonality, or hue.

So, I thought now might be as good a time as any to embark on converting all my old paper records into their digital equivalents. The goal here is to build a personal OCR / Document search engine that I can use securely across devices.

Disclaimers

I have not done much, if any, research into existing solutions for this because I'm approaching it as a personal project more than a business necessity. If it turns out better than expected, I may put some more work into refining the components.
I'm building this on a budget that can only be described as next to nothing. This is mostly to make it more of a challenge; most people could very easily spin up a service like this with a big budget.

Requirements

Security

These documents can vary in sensitivity from fuel receipts to medical records. For the most part, everything should be encrypted both in transit and at rest. No unencrypted data should ever be stored online, and the only way to get data from device to device is to authorize it via a previously used device. If any documents get leaked it might expose my crippling sugar addiction.

Tech Reqs

Documents will be coming through this service infrequently, so it's important to me both for price and overall system efficiency that the service not be running while its not in use. I also want to be able to test it locally in case of any weird document edge cases where the text being returned is not what I expected. Finally, while I do want a an accurate OCR implementation, accuracy is the lowest priority item right now because even decent quality OCR now is good enough for plain searching through documents if you add some kind of fuzzy matching.

OCR Engine

I'm a python developer by day and part of my job actually does happen to be interacting with different OCR engines. We've used some self-run solutions like Tesseract and some hosted solutions like Textract for products where accuracy matters more. For this project I'd love to use Textract but theres no free tier for it as far as I can see. So, I'll be rolling my own OCR API. As part of my job I did some analysis of different self hosted OCR solutions and as far as I could tell tesseract was the only OCR that had a low enough startup time for a transient service. The other solutions generally relied on a longer startup process that loaded their models into memory. Therefore, to manipulate the images before sending them into tesseract I'll be using OpenCV, and for the actual OCR I'll be using the python tesseract binding library PyTesseract.

Implementation

Containers

Both the image processing library OpenCV and the OCR library PyTesseract are not libraries that come in the default lambda environment. Complicating things even more, the OCR library I'm using is actually a binding around an executable, which means that will need to be available to the functions runtime as well. So, my options were to either upload a zip archive with the code and executables I wanted to run, or I could make a container that gets sent to ECR and run the lambda from there. In the end, I went with the container because it seemed a little bit cleaner and would likely give me a bit more useful experience.

Something thats important to note if you want to tackle building a container that will be run on lambda: you either need to build off of a lambda specific container base image. If you want to build off a custom base container you'll need to install a special "Lambda Runtime Interface Client (RIC)" that lets lambda call the code in the container. There was precious little information regarding the RIC, so I didn't dig too far into it.

I wont detail the full process of creating the container; there was a lot of swearing and trying to figure out what the alpine linux versions were of specific Pillow / Tesseract build requirements were. In the end, though, the Dockerfile looked like this:

FROM python:3.9-alpine
# Install dependencies
# This should probably be moved to a builder image
RUN apk add --no-cache build-base \
    jpeg-dev \
    zlib-dev \
    tesseract-ocr \
    cmake \
    g++ \
    make \
    unzip \
    curl-dev \
    autoconf \
    automake \
    libtool \
    libexecinfo-dev 

RUN pip install awslambdaric --no-cache-dir
RUN pip install pytesseract --no-cache-dir

# Copy in the app function code
COPY app.py /

# Heres where AWS will call the fn
ENTRYPOINT [ "/usr/local/bin/python", "-m", "awslambdaric" ]
CMD [ "app.handler" ]

All in, our container weighs a total of 438MB on the building instance, which for me was a t2.micro that I got for free under the AWS free trial. Once pushed to ECR using docker it actually only appeared as 159.5MB which I suppose means that they have one or more of my layers cached for free?

Lambda

Lambda Function

With the container ready to go, I began testing different functions contained in the app.py file. Including the RIC in your container allows you to test everything locally using a Lambda Runtime Interface Emulator. Heres how I was running the emulator during most of my testing:

docker run -v ~/.aws-lambda-rie:/aws-lambda \
    --env AWS_LAMBDA_FUNCTION_MEMORY_SIZE=512 \
    -p 9000:8080 \
    --entrypoint /aws-lambda/aws-lambda-rie \
    <container>:latest  \
    /usr/local/bin/python -m awslambdaric app.handler

This was super helpful for testing different memory allocation sizes, because that would be my main factor in determining the OCR cost per page. Weirdly enough, the local url you have to make the request to is constant, and seems to be from around the time that they were first developing python lambda functions:

http://localhost:9000/2015-03-31/functions/function/invocations

In the end, anything under a 512MB allocation would lead to an error when initializing whatever models tesseract was using under the hood. Because we're using a relatively low amount of memory, AWS will also give us a proportionally lower share of the CPU's time. In the end, the average of 5 seconds processing time per document isn't the worst. You might be thinking "But what if I submit a really big image, wont that make it really slow". Well yes, but there is a max request size for lambda function of around 6MB so there is a hard upper limit on the size of image that you can extract from. Generally I've found good results for images that are at least 2000x2000 which comes in well under the limit with any good image format.

API Gateway

Lambda functions only get you 90% of the way to a working API; a lambda is just a piece of code that gets run on a 'trigger'. Part of the creation of a lambda function is defining what that trigger actually is. In my case, I want my lambda function to be triggered on an HTTP(S) request. So I need to create a new endpoint from which we can actually request the extraction of a documents information. The only way that I'm aware of to do this is through AWS' API Gateway service. To set this up I followed their documentation on setting up an API gateway with Lambda.

Code

The code for the lambda function was quite simple, only made longer by a lot of error handling I had to put in when debugging things. You can find the full app code in this gist, but the main handler looks like this:

def handler(event, context):
    try:
        body = json.loads(event['body'])
        image_bytes = body['image'].encode('utf-8')
    except Exception as e:
        return {
            "error":"Could not get image from request",
            "exception":str(e),
            "event":event
        }
    try:
        img_b64dec = base64.b64decode(image_bytes)
        img_byteIO = io.BytesIO(img_b64dec)
        image = Image.open(img_byteIO)
    except Exception as e:
        return {
            "error":"Error decoding and opening image",
            "exception":str(e),
            "event":event
        }
    #result = pytesseract.image_to_data(Image.open('ocr-test.jpg'))
    try:
        result = pytesseract.image_to_data(image)
    except Exception as e:
        return {
            "error":"Error in tesseract",
            "exception":str(e),
            "event":event
        }
    return json.dumps({
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json'},
        'body': json.dumps(tsv2json(result))
    })

Very Important Note When you're using the API Gateway to trigger a lambda function, any JSON that you send to the endpoint gets turned into a UTF-8 string and put in the body section of the event. You might have noticed in the code above that I had to call json.loads(). This was the source of a long debugging session because I was used to a flask<->gunicorn combo where the request data comes in directly as a dictionary. It's also passed directly in as a dictionary when you're calling the lambda function directly with the RIE, or when you send in a 'test' from the online lambda console.

Appendix

Extra Docker Commands

Authenticating Docker With ECS

aws ecr get-login-password --region ca-central-1 | docker login --username AWS --password-stdin

Tagging and Pushing Your Built Image

docker tag  
docker push

Holy Crap, Systems Are Complicated

Simmo — Thu, 06 Jan 2022 14:00:16 GMT

The problem

I've stumbled upon what I think could be a fun small business to run; essentially its taking care of reminding people to send documents. I like the idea because its something that I can very likely have ready to test with customers within probably 1-2 weeks. With such a quick time to MVP I can test it without being too attached to the outcome. The only problem is, I don't have much experience building a full system like this, its been mostly CRUD web apps and backend services that don't need to deal with pesky user authentication thus far. My engineering degree is finally coming in use because I need to know about security methodologies (RBAC, UBAC, etc) and the very basics of systems design.

The very basics of what I'll need to get an MVP going are:

Front End (ofc, this will take the most dev time from me)
User management
Authentication
Object / Blob Storage
A database for User <=> Object mapping

I drew up a little diagram for how I imagine the whole backend will work:

This all needs to be hosted in Canada, because the target market is privacy regulation concious.

Now there is a decision to be made though -- do I go with a cloud provider for everything, or do I manage everything myself? We're going to assume 100 users for the first year, each user with an average of 5GB (some will have 10GB+, some will have <1GB. Its text documents so you'd have to be massive to hit 10GB). So a total of 500GB in storage. Additionally, let's assume an average of 100GB/month of traffic.

Cloud

Everything here is going to be priced as if I was buying it from AWS because that is likely where I would be buying from if I did end up actually using a cloud provider. Front end could be hosted on a nano instance, or just on Vercel for free. It cant be statically hosted because its a web app and will be fetching things from the BE and populating pages, and will need some API routes of its own to avoid CORS, etc. I'll price in about 3$ for that. User management and authentication would be done with AWS Cognito, and with <50k users its free (though 5 cents per user above that, so it could end up at 5$ 😱). We'd need a database as well, and just looking at the pricing for any service with RDS in its name makes me sick (the lowest MySQL instance is 300$/month) so we're going to integrate the database with the backend server. Because the back end will be performing its normal functions, plus the database I went with 2vCPUs and 8GiB of ram. Without a reservation (I don't want to reserve without knowing that I'll continue the project) it comes to 57$ per month. Lastly, if we do some very approximate averaging to the amount of data stored and transferred with S3 we get around 13$/month. So the total charges for AWS end up at 🥁🥁🥁🥁🥁 :

calculator The largest part of this is EC2 at 57$/month.

Dedicated / Virtual Server / Self-Managed

I'd end up running K3S or something and hosting the services all on one machine. So I'll upgrade it to a 4vCPU machine, which ends up at 30$ CAD per month. For block storage, assuming 300GiB of in/out/storage per month (which is just about the busiest month I can imagine this product having) it'll be 8$ CAD. So in total everything would be about half of the AWS bill at 38$/month, but I have to manage it myself.

Third Hidden Option -- Self-host

I have a 16 thread, 24GiB Ram machine collecting dust at home. I used to be an avid gamer but sold my GPU in the hot market we have now because it was worth 200$ more than I paid for it 2 years ago. Theres also approximately 20TiB worth of HDD sitting around in it as well, because I also have a lot of.... Linux ISO's. So if I use that (which would probably run me in the neighbourhood of 300$/month if I rented it from someone else in Canada) I can host all of this for ~ FREEEEEEEE ~. There is of course the possibility of my power going out, but I live close enough to the city core that I could probably offer a 99.9% SLA and not worry about it.

Conclusion

I'll likely host everything for free on my own hardware until the idea is validated. If I can get 1-2 paying customers before the end of February, I'll upgrade to a virtual server from OVH. By then I'll know the real processing power requirements as well! Finally, if I can get a decent customer base, I'll switch to AWS so that I can do multi-region more easily.

Thanks for reading

Optimizing Against You

Simmo — Tue, 04 Jan 2022 12:45:49 GMT

As it stands, social media companies seem to be acting like the cigarette companies of the 21st century. Millions in advertising trying to get as many people as possible to join the app they know to be harmful. While it's a companies responsibility to take care of their investors, who is responsible for making sure this is not at the expense of its users? I consider myself to be pretty cognizant of the addictive algorithms in use by corporations, and still got caught up in one unknowingly.

My Experience

When Canada's lockdown started and leaving the safety of your house felt like putting your life on the line, I started to go back through all the games I'd missed since I had quit before university. I played just about every type of game there is before finally landing on a game called Apex Legends. The most important thing to note about Apex Legends in this context is that it is free to play, the only method of supporting the game's continued development being cash shop cosmetic items and a once-a-month 'battle-pass' that allows you to unlock unique cosmetics by playing more.

Unfortunately, I didn't often have friends to play with, so I would play with other random teammates. This was fine while I was learning, but over time the inconsistency of my teammate's skill levels started to wear on me. Even in the Ranked mode where you're supposed to be matched with people of equal skill levels, I found myself being matched with other players that were 5-6 ranks off of my own. When I posted screenshots of this happening online, it seemed like many other players were experiencing this weird matchmaking as well and were similarly displeased. Eventually, the time cost of playing the game was too much, and I had to quit. I was curious though, the strange matchmaking wasn't a mistake; they were intentionally matching players of unequal rank against each other. Why was it done this way when it seems to cause so much anger in the community? As it turns out, it seems to be another case of a "free" product taking advantage of its users.

Skill Based Matchmaking (SBMM)

When you want to play something competitively, you likely want to be matched up with people that are within a similar skill bracket. Generally, you'd want a player that is a little bit above your rank to help you improve, or a player that is a little bit below your level to help you practice. The first popular system for ranking players in this way was created by a physics professor named Arpad Elo in 1950, initially to be used for ranking chess players.

The Elo rating measured the relative strength of a player in chess compared to other players in the league. Your ranking is inferred from your opponents and the results of the games you've had against them. Many modern rating systems in online games will, at their core, have a similar idea behind their ranking system: there is a variable assigned to every player, let's just call it Elo, that represents how skilled the player is in reference to the population. During regular play, players should generally be matched against those who have a similar ranking. When you win, your Elo goes up by an amount relative to the ranking of your opponent. Same thing if you lost. The bigger the disparity, the larger the change. So, over time, you should end up seeing your rank stabilize somewhere around your true current skill level. For most, this feels like a fair way to match players, but is it really the 'best'?

Engagement Optimized Matchmaking (EOMM)

In 2016 and 2017, EA filed patents for Dynamic Difficulty Adjustment (DDA) and Engagement Optimized Match Making (EOMM). Shortly thereafter, EA worked with a professor at UCLA to write and publish a paper that aimed to show the benefits of EOMM when compared with other matchmaking methods. The EOMM system they described is designed to try and learn enough about your playing habits that it can keep you engaged with the game as long as possible. The patent for it says "The longer a user is engaged with the software, the more likely that the software will be successful". They didn't give their definition of success in the patent, though in the paper they do say that the objective of an EOMM system can be optimized for both in-game time as well as real-money spending.

So, how do you keep your users engaged with the game for as long as possible? EA concluded that the best way to keep players engaged is to vary the difficulty of your game on the fly. It can do this through the use of what the patent calls knobs; controllable game parameters that will affect your player's experienced difficulty. The choice of knobs is one of the more important factors in DDA because it needs to be something that will go unnoticed by the user. The patent uses the example of a race car — if you adjust the max speed of the car based on if the user is winning or losing, that's going to be a very jarring experience for everyone involved. This limits the scope of what we can adjust in an online game; because most of the entities you engage with are other human players, it would be unfair to change an opponent's stats to affect the outcome of a duel. So, EA concluded that the fairest way to change the difficulty of a match to vary is the skill level of the players themselves.

But how does the system know who should be matched up with who? Using Machine Learning the game's operator can continually monitor each player's gaming habits and, after a certain threshold of time or matches, try to match you into a group of other similar players, called a cluster. Your habits can include anything from how often you quit after a win or loss, when do you spend money, how quickly do you start another match, etc. The cluster definitions, i.e. the approximate description of everyone in a cluster, and your assignment to that cluster won't stay the same over time. The churn risk between you and a potential opponent is calculated based on your habits as well as the cluster details of you and your opponent(s). The ideal set of matches is determined by minimizing the total churn risk across all possible matches that can be made.

In its paper, EA concludes based on a simulation that for games with a significantly large player population6 EOMM will at least match, if not outright beat, the churn-avoiding performance of all other matchmaking methods by around 1% per game. Which, calculated over a whole play session, will end up increasing the retention by 10-15%! This was taken conclusive proof that their EOMM system was the best for player engagement, but does that necessarily mean that it's the best for the players themselves? How does this affect the mentality of a player when the outcome of their matches seems to follow no obvious pattern.

Terms of Engagement

Engagement Optimized Matchmaking is just one of many examples of time optimizing algorithms in use on the web today. Many of the world's most popular websites: YouTube, Facebook and TikTok all have algorithms whose sole purpose is to make sure you spend as much time on their platform as possible. Eugene Wei has done an amazing series of articles on his blog about how the TikTok algorithm sees and interacts with the users of the platform. In it, he describes how quickly the algorithm can lock on to your particular content preferences, serving you up content that even you didn't know you'd enjoy. What happens when an algorithm gets to know your habits and weak points, even better than you know yourself? Anecdotally, it means that people report long sessions with the platform without really noticing how much time has passed. What makes these apps even more dangerous is that for every hour you spend consuming media, at least another hour of content that the algorithm can recommend to you has been uploaded. You can never reach the of your infinitely scrolling timeline.

What I want to ask is: Who has the responsibility of taking care of the users of these platforms? The companies are concerned mainly with their responsibility to their shareholders, and rightly so. In the past, when an industry has been created around something harmful (tobacco, alcohol, etc) there was government intervention in the form of heavy regulation. It seems now, though, that the problems we face from social media and the systems that create them are so complex that trying to regulate the allowable use of these algorithms in industry will be akin to playing "Whack-a-mole".

I think, in the end, it will come down to consumer education; it's the responsibility of those using these platforms to understand how they're being taken advantage of. But first, that information needs to be made public. We need to study how these algorithms work, and the information about the effects they have on their users made public. We should implement a warning similar to what you see on North American cigarette boxes or South American sweets so people know ahead of time what apps to watch out for.

Tinfoil Hat

Some people talk about 'the singularity'; a point at which AI becomes smarter than humans and we are immediately enslaved by the superior brainpower of an artificial master. What if instead of all at once, it was a gradual erosion of our free will as we find ourselves under the control of algorithms meant to optimize how we act. It seems to me that these social media algorithms are already kind of doing that: it's taking control of how we spend our time, by understanding us better than we know ourselves.