It doesn't work my life in software

Building a one-time file sharing personal service with AWS CDK and Python

It happens every now and that I want to share a file which is big enough, or is sensitive content, and I cannot or I’m not happy to upload it on a public free service.

When this occurs I start considering building a private version of a file sharing service, like one of the many free services out there, like WeTransfer or transfer.sh.

From a user experience standpoint, I would enjoy a simple HTTP API for uploading a file and getting a shareable link in return.

As a bonus point, I would love that those links would self-destruct once the file has been successfully shared.

I’m working with AWS Cloud Development Kit lately, which is a wonderful infrastructure as code toolkit that makes you really productive in building and deploying AWS solutions.

With CDK in your toolbelt, a project of this complexity turns into something which is a nice fit for “weekend project”, that could double up as training for me and as a source of inspiration for my fellow teammates.

So I did it.

Design rationale

The main idea stressed here is to have the entire service running on a purely on-demand model, that means avoiding any component being idle all the time when you’re not actually using the service, either uploading or downloading a file.

As always, there are many different ways to implement this kind of service, and when resolving a trade-off about correctness or managing edge cases, I would favour a “pragmatic hack” resulting in a cheaper solution.

I constrained myself to ship it after the the natural deadline of a weekend. This to both demonstrate how productive IaC can be with CDK, but also that my code clearly will not aim to be a masterpiece. So, no refunds .

Solution breakdown

once architecture diagram

On AWS we can use the following resources:

  • An S3 bucket to host the uploaded files
  • A Lambda function to implement the upload handler
  • Another Lambda function to implement a “smart” download handler, to delete the file after the very first successful transfer.
  • A DynamoDB table to store the information about the entries, would act as the “long-term memory” of our system
  • API Gateway can automatically expose our lambda functions as HTTP APIs

The upload “ticket”

The silliest thing you can do when handling S3 uploads is trying to manage the file transfer yourself. There are many good reasons to not handle it directly within a lambda function, like memory and time limits that would seriously limit the maximum size for your uploads. Just don’t.

The suggested solution is also well known and it’s based on S3 pre-signed URLs.

I often find useful to explain the concept behind pre-signed URLs using the “ticket” analogy. You basically expose an API that simply verifies a client is authorised to perform a specific upload operation, and returns a one-time ticket which can be immediately spent in a subsequent HTTP request to upload the file directly to the target S3 bucket.

The ticket comes in the form of an URL and a dictionary of request parameters to pass along with the file you want to upload. S3 verifies these tickets and rejects them once they expire, just like a ticket for a movie or a concert.

This way our function will completely offload the heavy lifting to S3, and that should make you happy as I am about it.

To be fair, this approach poses extra complexity on the client side, since now we have to perform two requests for each file upload, but I consider this to be perfectly fine in this context, and will address it writing a smart client to upload files.

The Python code for this function can be easily sketched:

import random
import string

import boto3

# Generate a random key
domain = string.ascii_uppercase + string.ascii_lowercase + string.digits
entry_id = ''.join(random.choice(domain) for _ in range(6))
object_name = f'{entry_id}/{filename}'
response['once_url'] = f'{APP_URL}{entry_id}/{quote(filename)}'

# Add an entry into the DynamoDB table
dynamodb = boto3.client('dynamodb')
dynamodb.put_item(
    TableName=FILES_TABLE_NAME,
    Item={
        'id': {'S': entry_id},
        'object_name': {'S': object_name}
    })

# Generate a return a pre-signed post for the file
response['presigned_post'] = create_presigned_post(
    bucket_name=FILES_BUCKET,
    object_name=object_name,
    expiration=EXPIRATION_TIMEOUT)

Downloading the file

For the lambda function we’ll code for downloading the files, we can use a similar approach, performing the following steps:

  1. Check from the URL if there is a matching entry in the DynamoDB table
  2. Generate a pre-signed URL for the corresponding S3 key
  3. Redirect the client with an HTTP 301 response (I ❤️ HTTP)
def on_event(event, context):
    entry_id = event['pathParameters']['entry_id']
    filename = urllib.parse.unquote_plus(event['pathParameters']['filename'])
    object_name = f'{entry_id}/{filename}'

    dynamodb = boto3.client('dynamodb')
    entry = dynamodb.get_item(
        TableName=FILES_TABLE_NAME,
        Key={'id': {'S': entry_id}})

    if 'Item' not in entry or 'deleted' in entry['Item']:
        return {'statusCode': 404, 'body': f'Entry not found: {object_name}'}

    s3 = boto3.client('s3')
    download_url = s3.generate_presigned_url(
        'get_object',
        Params={'Bucket': FILES_BUCKET, 'Key': object_name},
        ExpiresIn=PRESIGNED_URL_EXPIRES_IN)

    dynamodb.update_item(
        TableName=FILES_TABLE_NAME,
        Key={'id': {'S': entry_id}},
        UpdateExpression='SET deleted = :deleted',
        ExpressionAttributeValues={':deleted': {'BOOL': True}})

    return {
        'statusCode': 301,
        'headers': {
            'Location': download_url
        }
    }

As you may have noticed, this function doesn’t actually delete the file, as it would invalidate the download operation, which will take some time to happen. The function will just mark the DynamoDB entry as deleted, procrastinating on the actual file disposal.

If you try to type or paste any link into a messaging service (like Slack, Messenger, WhatsApp, Telegram and friends), they will visit the link, performing an actual HTTP GET request. There are legitimate reasons for that, like verifying link safety or simply for building a nice preview.

The downside of this it that any link shared on these services would be automatically invalidated before the intended recipients actually gets the chance of downloading the file.

To prevent this from happening I’m employing another high-tech and innovative technique: user agent filtering 😂!

MASKED_USER_AGENTS = [
    '^Facebook.*',
    '^Google.*',
    '^Instagram.*',
    '^LinkedIn.*',
    '^Outlook.*',
    '^Reddit.*',
    '^Slack.*',
    '^Skype.*',
    '^SnapChat.*',
    '^Telegram.*',
    '^Twitter.*',
    '^WhatsApp.*']

user_agent = event['headers'].get('user-agent', '')
is_masked_agent = any([re.match(agent, user_agent) for agent in MASKED_USER_AGENTS])
if is_masked_agent:
    log.info('Serving possible link preview. Download prevented.')
    return {
        'statusCode': 200,
        'headers': {}
    }

You got the idea: if the user agent matches one of the simple regular expressions, we won’t invalidate our entry and return a successful but empty response instead. Fair enough, right?

Scheduling a lambda function to delete the files marked as “served”

We are able to create links and have them automatically invalidated on the very first download. Neat! But files are still accruing in the S3 bucket and we want to remove them, once in a while, using another lambda function.

And this is a very good excuse to use EventBridge and another useful serverless pattern: schedule a lambda function to run periodically.

The CDK code for implementing this is straightforward:

cleanup_function = lambda_.Function(self, 'delete-served-files-function',
function_name='once-delete-served-files',
description='Deletes files from S3 once they have been marked as deleted in DynamoDB',
runtime=lambda_.Runtime.PYTHON_3_7,
code=lambda_.Code.from_asset(os.path.join(BASE_PATH, 'delete-served-files')),
handler='handler.on_event',
log_retention=LOG_RETENTION,
environment={
    'FILES_BUCKET': files_bucket.bucket_name,
    'FILES_TABLE_NAME': files_table.table_name
})

files_bucket.grant_delete(cleanup_function)
files_table.grant_read_write_data(cleanup_function)

events.Rule(self, 'once-delete-served-files-rule',
schedule=events.Schedule.rate(core.Duration.hours(24)),
targets=[targets.LambdaFunction(cleanup_function)])

The function will be automatically triggered every day and it will delete any file it finds in the DynamoDB table, marked as deleted.

Securing the upload API

There are many different ways to secure an API and this specific one is meant to be used by a single tenant (that’s why I called this a personal service). So I rolled out a simple ad-hoc authentication scheme, based on HMAC and a symmetric key that works as follows.

The client will calculate a signature before GETting the upload ticket, according to the folling steps:

  1. builds a URL with includes the filename and the current timestamp, according to a pre-defined format (and timezone)
  2. computes an hash of this string using SHA-256 and signs it using a secret key
  3. sends the computed signature (conveniently encoded in base64) along with the request, using a custom HTTP header
import os
import base64
import hashlib
import hmac
from datetime import datetime
from urllib.parse import quote_plus, urljoin

import requests


ONCE_SIGNATURE_HEADER = 'x-once-signature'
ONCE_TIMESTAMP_FORMAT = '%Y%m%d%H%M%S%f'


secret_key = base64.b64decode(os.getenv('ONCE_SECRET_KEY'))

req = requests.Request(method='GET', url='/', params={
        'f': quote_plus(os.path.basename(file.name)),
        't': datetime.utcnow().strftime(ONCE_TIMESTAMP_FORMAT)
    }).prepare()

plain_text = req.path_url.encode('utf-8')
hmac_obj = hmac.new(secret_key, msg=plain_text, digestmod=hashlib.sha256)

req.headers[ONCE_SIGNATURE_HEADER] = base64.b64encode(hmac_obj.digest())
response = requests.Session().send(req)

Obviously the key can be generated randomly and then written into some local configuration file.

Vanity URLs with my custom domain

Ok we now have a working version that can be easily deployed on any AWS account, just running

cdk deploy

And it’s something that can be used as a simple example of a serverless architecture for begineers. But what is the point of building a personal service and not using vanity urls?

I was waiting for the right occasion to use my freshly registered domain (failing.app) to expose the service.

So I created another CDK stack to configure a custom domain within a hosted zone, a TLS certificate and the required API Gateway configuration.

class CustomDomainStack(cfn.NestedStack):
    def __init__(self, scope: core.Construct, id: str,
        hosted_zone_id: str,
        hosted_zone_name: str,
        domain_name: str,
        api: apigw.HttpApi):
        super().__init__(scope, id)

        hosted_zone = route53.HostedZone.from_hosted_zone_attributes(self, id='dns-hosted-zone',
            hosted_zone_id=hosted_zone_id,
            zone_name=hosted_zone_name)

        certificate = certmgr.DnsValidatedCertificate(self, 'tls-certificate',
            domain_name=domain_name,
            hosted_zone=hosted_zone,
            validation_method=certmgr.ValidationMethod.DNS)

        custom_domain = apigw.CfnDomainName(self, 'custom-domain',
            domain_name=domain_name,
            domain_name_configurations=[
                apigw.CfnDomainName.DomainNameConfigurationProperty(
                    certificate_arn=certificate.certificate_arn)])

        custom_domain.node.add_dependency(api)
        custom_domain.node.add_dependency(certificate)

        api_mapping = apigw.CfnApiMapping(self, 'custom-domain-mapping',
            api_id=api.http_api_id,
            domain_name=domain_name,
            stage='$default')

        api_mapping.node.add_dependency(custom_domain)

        route53.ARecord(self, 'custom-domain-record',
            target=route53.RecordTarget.from_alias(ApiGatewayV2Domain(custom_domain)),
            zone=hosted_zone,
            record_name=domain_name)

So, if you have an hosted zone on Route53, you can set a couple of environment variables and have the required resources hygienically deployed as a nested CloudFormation stack.

Final thoughts

Running this type of personal services on AWS basically for free it’s a wonderful thing.

Even if I had to tackle a couple of small issues with some missing L1 constructs, the experience with CDK has been a breeze so far.

Although almost anything about this post is perfect, (the code, my awful English) I enjoyed building it and sharing it here.

If you’re interested in building modern services or just want to learn more about how things work “in the cloud”, this is exactly what I encourage you to do, if you’re not doing it yet, get a free AWS account and start building your own thing, it has never been so easy and accessible.

I’ve released the full source code of this project under an open source license, and can be freely used and distributed.

Get it on GitHub.