Avatar

Smyler.net

Hacking, software development, networking and everything in between

Challenges in the challenge: CTFs, K8S, SSRF and the cloud

Lessons learnt from 404 CTF 2023

2023/11/29

404 CTF logo

Hosting CTF challenges is never an easy thing to do, and hosting them in a reliable and scalable way is a challenge by itself. In June 2023, HackademINT held the second edition of the 404 CTF with its partners, and Soremo and I were in charge of overseeing the configuration of the infrastructure hosted on OVHcloud's public cloud services. This included the deployment of around 50 CTF challenges. We are quite proud of the result, as the CTF was a real success and the infrastructure was overall very reliable, only suffering about 3 minutes of downtime during the month-long CTF. Some specific challenges had to be taken down temporarily from time to time, but that came down to issues in the challenges and not in the infrastructure itself.

Still, we did suffer one minor security incident that turned out to be very instructive for me, and I want to address it in this blog post, hoping it may be useful to others.

The infrastructure

For HackademINT, hosting CTFs is all about hosting many small services with high availability, and Kubernetes has been our preferred technology to do so for some years now. We use it internally to host our own services, as well as occasional events at Télécom SudParis like Hacktion. We also used it 2022 to host the first edition of the 404 CTF. So naturally, we stuck with it in 2023 (as we say in French, on ne change pas une équipe qui gagne), and used OVHcloud's managed Kubernetes solution.

Kubernetes nevertheless has its downsides, as it's not suitable to host kernel exploitation challenges out of the box, for example. In fact, as it relies on containers, any challenge that requires full-on virtualization is out of the question, unless we start looking at additional solutions like Kata containers. Another downside of Kubernetes is also that it is most suitable for stateless services, which is something challenges have to account for.

Appart from challenges, we also had to host the main CTF platform, which for us is CTFd. Other solutions exist, but we are used to it, and it is probably the most battle-proven open-source project out there. We also hosted CTFd on Kubernetes, leveraging OVHcloud's SaaS offering for its stateful components (MariaDB, Redis, and S3). We wrote a Helm chart to manage the deployment, which may be useful to others.

Most challenges were simply deployed as Kubernetes deployments, with three to five replicas depending on the workload, and a service that does load balancing between the challenge's pods. That may be a bit overwhelming if you've never played with Kubernetes, but it really isn't that complicated. All you really need to understand is that each challenge is replicated in different pods (which are analogous to Docker containers in our case), and that we have a service sitting in front of the pods that forwards the connexions it receives from players to the pods, balancing the load.

Kubernetes deployment schema

Securing challenges

There are few things we need to keep in mind when deploying challenges to ensure the CTF is secure and fair for all players:

  1. We need to protect the infrastructure against malicious players. We don't want it to get compromised, that would be bad for everyone.
  2. We need to isolate players from each other. For example, we don't want players to solve challenges simply by looking up what other players are doing using logs or listing processes. It is also important that players with malicious intentions cannot target other players.
  3. We need to prevent our infrastructure from being used nefariously by players with malicious intentions. If we provide users with a shell in a container, the last thing we want is to have them use it to pivot and attack external services.

These goals mean we have a clear difference between challenges with remote code execution by design and challenges that do not.

You see my friend, there are two types of CTF challenges - Those with RCE, and those easy to host

Following in the footstep of the 2022 edition, we stuck with nsjail for most RCE challenges (Qemu was also used in some cases). It has the good taste of disabling networking by default in the sandboxes it creates, but the defense in depth principle dictates that it may be a good idea to have additional mechanisms in place to enforce objective 3. This is especially true considering that many challenges did not use nsjail.

Challenge networking

So, we considered our options to prevent challenge pods from doing any networking except responding to their service, all the while keeping things simple and manageable enough, so we could apply it to our 50 or so challenges. It turns out that Kubernetes has got it covered with network policies.

While they are quite powerful, they remain relatively simple. Most importantly, they are stateful filters, meaning they remember TCP connections and can tell which way they were opened, differentiating between ingress (incoming connections), and egress (outgoing connections). By default, all traffic is allowed, but as soon as a declared network policy matches a pod, the filtering switches to an allow-list paradigm where connections have to be explicitly allowed by the policy.

One limitation of network policies is that we cannot match services by name, so we couldn't easily make an ingress policy that would only allow connections to pods from their own service. We did not consider it to be a huge risk, and decided to limit what the challenge pods could do instead, focusing on egress policies. And it was simple: deny all egress traffic for any pod in the challenge namespace.

# Kubernetes manifest for our default challenge network policy,
# which forbids all egress traffic from any pod.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default
spec:
  podSelector: {}  # An empty pod selector matches all pods.
  policyTypes:
  - Egress  # We declare the 'Egress' type so the policy applies an allow-list to all egress traffic.

# And we don't declare the allow-list,
# leaving it empty and effectively forbidding all egress traffic.

The problem & its (incomplete) solution

Now, if you played the CTF and attempted a specific web challenge, you may already have realized that this policy poses a problem. Some challenges do, in fact, need to communicate with the Internet. More precisely, one of the challenges, L'épistolaire moderne, requires players to exploit a Server Side Request Forgery (SSRF) vulnerability to steal cookies.

L'épistolaire moderne challenge screenshot

Players were given a website on which they could chat with La princesse de Clève, and the link preview feature could be used to make HTTP GET request to arbitrary URLs from the challenge's pod. I have redeployed the challenge on a dedicated OVHcloud managed Kubernetes cluster for the purpose of this article, so the following screenshots of the challenge are not from the actual CTF.

Chat screenshot

So, in order for this challenge to work, we need to allow it to make GET requests to web servers controlled by players. In other words, it needs to be able to access the internet.

Now, we don't want to allow players to perform arbitrary SSRF from our pod, as it could potentially access anything in the cluster's network. That includes the Kubernetes API, other pods, etc.

The first limitations are put in place by the challenge itself: it only makes GET requests to HTTPS urls, and it only returns a sanitized version of the page's title.

@app.route("/api/preview-title", methods=["GET"])
@token_required
def preview_title(id):
    # get parameter :
    url = request.args.get("url")
    if (
        re.match(
            r"^https:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~\#\?&\/=]*)$",
            url,
        )
        == None
    ):
        return {"error": "URL invalide"}, 400
    r = requests.get(url)
    html = r.text
    title = html[html.find("<title>") + 7 : html.find("</title>")]
    title = sanitize(title)
    return {"title": title}, 200

The second limitation we put in place was to forbid the pod from accessing anything inside the cluster using a second network policy that gets applied on top of the first one. Notice that the policy only matches pods with the egress-allow-internet label set to true, which in practice consisted only of the pods of the Épistolaire moderne challenge.

Combined, our two policies allow the challenge to make connections to any IP address that is not in the 10.0.0.0/8 range, which corresponds to our cluster's internal network.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-internet
spec:
  podSelector:
    matchLabels:
      egress-allow-internet: 'true'
  egress:
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
  policyTypes:
  - Egress

And with that, we thought we were all good and our challenges were well protected and our infrastructure was secure.

The flaw

One thing we wanted to do in 2023 was to get a chance to meet players of all levels, and not just the top 100 that would get invited to the price ceremony. This led us to organize a live event at CampusCyber half-way through the CTF, during which a second wave of challenges was released. L'épistolaire moderne was part of that second wave. The event was amazing, and besides the Kubernetes cluster deciding to upgrade itself right when we were about to release the new challenges, everything went fine, and we had a wonderful time meeting everyone.

But only a few minutes after we left CamusCyber, we received messages from one of the players about an SSRF vulnerability that was allegedly exposing our AWS infrastructure.

Discord screenshot

The first message puzzled us a bit as the CTF was hosted by OVHcloud, and not at all by AWS. But the second one made it quite clear: there are apparently internal services outside the 10.0.0.0/8 network available from OVHcloud's managed Kubernetes. 169.254.169.254 is a well-known IPv4 address from the link-local reserved subnet, and is used by AWS to expose their API to their EC2 instances.

The thing is, AWS is not the only cloud provider to use this address, and OpenStack, on which OVHcloud's public cloud is based, implements an AWS compatible metadata service. Since our managed cluster's nodes were hosted on that OpenStack infrastructure, that service was accessible from the cluster.

Luckily, the parisian metro has good 5G coverage, and we were able to implement a fix in about five minutes (more on that later). Only later did we have time to take a better look at the problem.

The exploit

So, how can we force the link-preview feature from above to SSRF to http://169.254.169.254?

There are, in fact, two things we need to circumvent:

  1. Making a request to an HTTP server when the challenge only accepts HTTPS URLs.
  2. Retrieving interesting data and not just the title of the page.

For the first one, we can send an HTTPS URL to a domain we control, and have it redirect to the metadata URL. There are many ways to achieve this, and I ended up using a simple NGINX configuration to redirect https://redirect.smyler.net to http://169.254.169.254/openstack/latest/meta_data.json, which returns all the metadata as JSON.

server {
  server_name redirect.smyler.net;
  return 301 http://169.254.169.254/openstack/latest/meta_data.json;
}

As for the second one... Well, there isn't anything to do. The challenge's code assumes there is an HTML title somewhere in the response, and fails when there isn't one.

r = requests.get(url)
html = r.text
title = html[html.find("<title>") + 7 : html.find("</title>")]
title = sanitize(title)

Indeed, Python's str.find() returns -1 when it doesn't find anything, but -1 is a perfectly valid index in Python, pointing to the last character of the string. So that third line of code would return anything but the first six characters of any valid text response that does not contain either <title> of </title>.

At that point, we know we can send La princesse de Clève a message containing https://redirect.smyler.lab to access the metadata service.

SSRF demonstration

The fix

The first fix we deployed right when we heard about the issue was to modify our network policy. Indeed, we wanted to allow the challenge access to the public internet, and the 169.254.0.0/16 network is, in fact, not part of that public internet. So we blocked all addresses from IANA's special purpose address registry (including some that probably wouldn't have worked anyway).

Network policy diff vue after fix

Note that Kubernetes network policies were our only options to prevent access to 169.254.169.254, as alternatives like OpenStack security groups would have acted at the cluster node level, and nodes need to be able to access the metadata service to function properly, according to OVHcloud's documentation.

The second fix we implemented was to limit the chatbot API's response to 20 characters for page titles. This would have severely limited any further attempt to leak data using the bot.

    html = r.text
    title = html[html.find("<title>") + 7 : html.find("</title>")]
-   title = sanitize(title)
+   title = sanitize(title)[:20]

Now, had we planned this better, there are additional measures we could have taken to further control the SSRF, including:

We might implement those in future CTFs.

Impact

So, how bad was it? Not that much, as far as we could see.

The metadata service seems quite limited, and the only real pieces of data that may have been accessible to players are the SSH public key of three HackademINT staff. Still, this is an issue we thought we had covered, and it shows how important it is to know and secure the entire infrastructure stack. Especially in the context of a CTF, and especially when we start voluntarily pocking holes like allowing SSRF.

It was a great lesson overall, and we even got a dedicated meme from the players at the end of the CTF, which is always cool.

Meme

Related resources:

More articles