Finding Crown Jewels: Hunting Through 180,000 Ruby Gems

Finding Crown Jewels: Hunting Through 180,000 Ruby Gems

Overview

Anvil CTO, Vincent Berg, pointed a homemade scanner at nearly 180,000 Ruby gems and started pulling on threads when he found something funny. What started as a simple experiment quickly uncovered a surprising number of security issues, packaging mistakes, and exposed secrets hiding in public software packages. He recounts how a small side project turned into a deep dive through the Ruby ecosystem.


"The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' (I found it!) but 'That's funnyโ€ฆ'"

-Isaac Asimovโ€ฆ?

This post is a version of a talk I gave at Hammercon, Anvil's internal conference. In it, I explored what happens when you point a homemade scanner at every published Ruby gem and start pulling on threads. While a lot of companies try and sell complex dependency management and supply chain security analysis systems, I wanted to experiment and show that with relative ease, you can still do this from the comfort of your own home. All you need is some time, crummy Python scripts, and a bit of data storage. I started with some home tool building, which led me to discover a permission issue by accident, and then turned into a more structured analysis of Ruby gems.

If you just want the tool, it lives at github.com/anvilsecure/crownjewelscanner. The rest of this post is the journey.

The Thread

This started as a side-quest from an unrelated project. I'd built a small filesystem snapshotter. This was a simple tool that simply diffs what changes are on disk before and after installing a package. Think of it as a more generalized version of dawgmon. I mostly wanted to use it to easily analyze all the artifacts that crummy commercial software installers of enterprise software install all over one's system. At some point iScanning a VM, I did random development work and my tool stumbled over a bunch of world-readable files.

-rwxrwxrwx 1 root root 95 Feb 6 20:55 lib/lightly/lightly.rb

A .rb source file, owned by root, world-writable, straight out of gem install. Any local user could rewrite that file, wait for the next process to require it, and execute code as whoever loaded the gem next. Not exactly the end of the world. And you need local access to exploit it. But unambiguously broken โ€” that was a "that's funny" moment. The next question was the obvious one: are there a lot of these types of gems out there with insecure file permissions? I decided to have a look.

First Let's Mirror the World

Before scanning anything, I needed all the gems. Given that I did not have a massive NAS lying around, I decided to focus on the last version for each gem. To download all these gems there is a gem called rubygems-mirror. Configuring it is as easy as doing something akin to:

$ gem install rubygems-mirror
$ cat ~/.gem/.mirrorrc
--- 
- from: http://rubygems.org 
  to: /gems 
  parallelism: 10 
  retries: 3 
  delete: false 
  skiperror: true 
  hashdir: false 
$ export RUBYGEMS_MIRROR_ONLY_LATEST=TRUE 
$ gem mirror 
Total gems: 181001 
$ du -hs /gems/ 
54G     /gems

After downloading roughly 54GB and around ~180,000 gems, I had the latest version of each gem.

What Even is a Gem?

A gem is just a Ruby package. But I mostly stayed away from the Ruby ecosystem as I'm a Pythonista at heart and, on principle, I refused to read any documentation on the gems. I just picked a random gem and started poking at it. First, I ran file and then just extracted its contents:

$ file zzheynow-0.1.0.gem 
zzheynow-0.1.0.gem: POSIX tar archive 
$ tar xvf zzheynow-0.1.0.gem 
metadata.gz 
data.tar.gz 
checksums.yaml.gz

Cool. A gem is a tarball with three files inside it:

  • metadata.gz: a gzipped YAML blob describing the gem (name, version, authors, dependencies, etc.). It's actually a serialized Gem::Specification Ruby object.
  • checksums.yaml.gz: the SHA-256 hashes of the other files and possibly also the SHA-512 hashes.
  • data.tar.gz: actual code/contents of the gem.

All this sounded reasonable. Now I sort of knew what gems look like. So I decided to write a script to start unpacking all the gems and analyzing what I found.

Assert Everything!

Building the scanner became an exercise in challenging every rule and assumption that I would make about the gem ecosystem. I would write code that assumed something that obviously must be true, such as "every gem has a data.tar.gz" because otherwise a gem is worthless. That seems a fair assumption. However, this is not enforced by Ruby gems ingestors and the wider gem ecosystem. So I found gems that had no data.tar.gz whatsoever. I would run my tool and then watch it explode on a few hundred edge cases with each of them turning into a "that's funny!" moment.

Here's a short, non-exhaustive list of things I identified:

  • Some gems have no data.tar.gz at all. They just have metadata and are basically an empty package, technically valid, sitting out there on rubygems.org.
  • Roughly 50,000 gems have no checksums.yaml.gz. This seems to be because of checksum signing being introduced at a later stage to the gem format.
  • At least one gem has no metadata.gz. Don't ask me how it got published. There is obviously no strict check upon upload.
  • Some gems use a totally different layout. This is where the contents are for example directly in-lined under data/ rather than wrapped in data.tar.gz. A case of tarring them up and creating the gems by hand gone wrong I presume.
  • Some gems have signature files (metadata.gz.sig, data.tar.gz.sig, etc.) at the top level, bringing the file count to six. About 1.14% of gems are signed. I never bothered actually verifying any signatures, but it would be interesting to see if there are lots of mistakes there and how many are validly signed.
  • Some gems try to escape the extraction directory via symlinks pointing at things like /Users/<someone>/.streerc, /etc/host, or absolute paths into the original author's home directory. If you naively tar xf them as the wrong user, fun things might happen all over your file system. Beware when extracting them.
  • Versioning is a free-for-all. I assumed filenames were using regular semver type versioning akin to <name>-<major>.<minor>.<patch>.gem. Reality includes things like:
    NSecurityTest-111111111111111111111111111111111111111111111111111111111
    111111111111111111111111111111111111111111111112.gem

    Yes, that's the actual version "number." It's a 100+ digit integer.

The lesson kept repeating itself: if the format allowed it, somebody did it. As such I asserted every assumption I made about the gem format and then let the assertions fail as I scanned all the downloaded gems. This yielded the list above. As I went through this, I dumped all the metadata into a SQLite database. One row per gem with columns such as is_signed, has_datagz, has_checksum, has_metadata etc. I also stored the SHA-256 of the gem, parsed checksum results, and so on.

Once the scanner could reliably open every gem and do some cursory indexing, the interesting question became: what's in there that shouldn't be? I scanned for a handful of things.

Further Analysis

First, I decided to look at the file permissions. That is what got me undertaking this project in the first place as the "that's funny" realization at the start of this blogpost. It turns out that that one specific gem had a lot of cousins out there. After indexing I found that:

  • ~2.6k gems contain world-writable .rb source files (~1.5% of the ecosystem).
  • 3.5k gems contain world-writable files of any kind.
  • 4 gems contain setuid executables.
  • 11 gems contain setgid executables.

Most of the setuid/setgid cases were unexploitable in practice. They were Ruby scripts with a #!/usr/bin/env ruby shebang which on Linux means that the setuid bit gets ignored upon execution. Then there were text files with weird permissions. An example was ./.idea/vcs.xml with -rwsrwsrwt. Or a bunch of other non-ELF files like .travis.yml which are not executable. Most likely wrongly applied chmod calls like blanket chmod -R 777.

There was one notable exception. There was a MacOSX-only gem shipping a setuid+setgid root Mach-O binary in lib/<name>/vendor/. Anyone running sudo gem install <name> could then end up with a privilege escalation if there are bugs in said binary. Of course, for all of these issues the fix is "don't preserve those bits during install," but the gem installer doesn't enforce that. Simply running gem install <gemfile> as root would result in issues.

The mirror's heaviest gems make for entertaining reading as well:

492M  finnhub_ruby-1.1.19.gem
455M  emojidex-rasters-1.0.34.gem
416M  my-wkhtmltopdf-binary-0.12.6.8.gem
416M  wkhtmltopdf-binary-0.12.6.8.gem
334M  rhodes-7.6.0.gem
307M  foundational_lib-1.0.1.gem
278M  wkhtmltopdf-binary-arm64-0.12.6.8.gem

Some of these are easily explained. For example, the gem wkhtmltopdf-binary ships the whole wkhtmltopdf binary which renders complex websites to a PDF output. Others are weirder however:

  • finnhub_ruby doubles in size each release. v1.1.7 is 125K. v1.1.19 is 492M. Why? It appears to ship every previous version of itself, recursively. There are more Ruby gems that have this issue.
  • One ~141 MB gem turned out to contain an 85 MB QuickTime walkthrough video of the gem's CLI. That's funny!
  • A ~132 MB gem contained four trained ONNX models for face detection (YuNet) and object detection (YOLOv5/YOLOv8 in two sizes each). It's a Ruby binding for those models, so it's not entirely unreasonable, but it does mean every CI install pulls 132 MB of weights down which seems like a fair amount.
  • One ~265 MB gem in-lined its entire vendor/ tree directly rather than declaring its dependencies. This reinvents static linking pretty badly and you might end up with outdated vulnerable dependencies in one's code.
  • One ~200 MB gem was just a 200 MB file of null bytes. Inside the tarball, alongside the normal gem contents, was a file named big that consisted just of NULL-bytes for the entire 200 MB.

Once I had the scanner reading file paths, the obvious next move was looking for other interesting files. Initially I just flagged on filenames and extensions. A quick list:

  • Database files: .sqlite, .sql, .db
  • Environment files: .env, .env.production, etc.
  • Archives nested inside archives: .tar, .tgz, .gz, .zip
  • Backup/swap files: .bak, .swp, .swo
  • Sensitive directories anywhere in the path: .ssh/, .git/, .aws/, .vim/
  • Credential filenames: authorized_keys, passwd, creds, id_rsa, id_dsa
  • Office/document leaks: .xlsx, .docx, .pdf, .rtf
  • Code from other languages: .py, .java, .c, .rs, .jar, .php

The first pass had way too many false positives. For example, there were a lot of tests in all these gems that match on id_rsa if you don't filter properly. So, I added a heuristic exclusion for paths containing test/, tests/, spec/, templates/, docs/, examples/, samples/. That cut the noise dramatically.

After that filtering, I ended up with roughly ~16k gems containing at least one "interesting" path. That's about 9% of the ecosystem.

A handful of standouts (gem names redacted where the issue may not yet be resolved):

  • A small gem with a credentials.tar.gz next to the normal data.tar.gz, at the top level of the gem itself. Inside was a single YAML file containing a working Ruby gems API key. Whoever published this gem accidentally tarred their entire packaging directory, including the credentials file the gem push command had just used. The key was active when I tested it; it was rotated shortly after disclosure.
  • A gem containing a .env file with a SafeCharge merchant API key, two database passwords, and a cPanel password. The .env even had helpful inline comments explaining which IP each value pointed at.
  • A gem containing the maintainer's full CI configuration, including an encrypted GPG private key, a passphrase to decrypt it, and a separate Ruby gems API credentials file. The keys appeared to have been rotated by the time I checked.
  • A gem shipping lib/<name>/keys/id_dsa and lib/<name>/keys/id_rsa โ€” these turned out to be intentional test fixtures, not real keys. False positive, but a useful one: the scanner caught it and I did the disambiguation.
  • A ~135 MB gem containing decade-old MySQL dumps from what appeared to be a former employer's internal systems. Real user records, real Social Security numbers (hashed but stored alongside the originals), real salaries, real W-9 data. The gem was a maintenance tool that had been published with the developer's working directory still attached. It took a while to properly disclose this, but in the end the Ruby gems were pulled from rubygems.org after we disclosed this.
  • A gem containing real customer-facing files: account spreadsheets, an "active employee report" XLSX, building access pass listings, user PDFs, and a CSV of users from an internal system of a branch of an internationally operating bank in New York City. None of it should have been anywhere near a public package registry. These packages were yanked from the Ruby gems repository after reporting through the appropriate channels.

Secret Scanning

The filename-based pass found a lot, but plenty of secrets don't live in conveniently-named files. I considered using something off the shelf like gitleaks or trufflehog, but in the end I wanted to build a little rule engine myself. For the rules I used the great Rexpository. This is a community-maintained collection of regular expressions for detecting credentials of various flavors (think of AWS credentials, Stripe API keys, GitHub tokens, JWT, private keys, etc.) I took a subset of these rules that I found interesting and wrote some Python code around it.

The first run took forever. A few practical lessons emerged immediately:

  • Some regexes are catastrophically backtracking. A few rules in the original collection suffer from ReDoS so I ended up rewriting the code such that I would simply bail out after any single-file scan would take longer than 5 seconds.
  • Some rules match too much. UUIDs, generic API tokens, and SHA-512 for example. Detection accounted for tens of thousands of "hits" each. These rules are not completely useless, but they would need to be combined with another set of rules to properly triage them.
  • Bail out after N matches per rule per file. If a single log file has 5,000 instances of the same Blowfish hash match, I will not store all of them and I just bail out. My database will say "this file has a lot of instances of x" and I will move on and if needed I can do manual inspection can then later.

After some back and forth and further tuning, the database ended up with about ~94,000 secret-rule hits across the corpus. The truly interesting subset was much smaller, but it had interesting things in it:

  • 8 active Ruby gems API keys were found embedded in published gems. Several were still valid at scan time. Given that with these Ruby gems API keys you can push new versions, this can lead to supply-chain attacks on any downstream user of these gems as an attacker can publish new, backdoored versions of these gems.
  • AWS credentials in at least one gem dating back roughly 14 years. The access key and secret were in a YAML config. I confirmed the key was still attached to a real (root!) account by calling aws sts get-caller-identity. The maintainer responded and rotated very quickly once I disclosed.
  • An OpenAI API key with the sk- prefix, sitting hard-coded in lib/<name>/gpt.rb. By the time I tested it the key had moved past the OpenAI deprecation cliff for text-davinci-003, but the key itself was still authenticating fine.
  • PostgreSQL connection strings with the same credentials reused across production, development, and test blocks of the same database.yml.
  • Trello API keys, Pastebin API keys, Drupal hashes (mostly false-positive-y test fixtures), and a long tail of Blowfish/bcrypt strings scraped from spec/dummy/log/ files that nobody intended to commit.

Summarizing

After all the iteration, the final stats from one full run looked roughly like this:

#gems: 179,686  #secrets: 93,788
Gems with invalid toplevel:                 4
Gems with binaries:                         1,245
Gems with world-writable .rb files:         2,606
Gems whose extraction escapes its path:    11
Gems containing setuid executables:         4
Gems containing setgid executables:         11
Gems with at least one "interesting" path:  16,357

That is a lot of potential secrets identified, but a good chunk will be false positives.

If you're going to do something like this, my advice is to assert every assumption you make and let the data you run through correct you. Every wrong assumption I made about the gem format (top-level structure, version syntax, file presence, encoding) turned into an assert that fired on real input. Each fired assertion was a discovery and taught me something new about the Ruby gems ecosystem.

The scanner can be found at github.com/anvilventures/crownjewelscanner. It is a quick and dirty research tool written in Python so don't expect anything polished, but it might help those interested with reproducing this work and/or figuring out a way to build upon it further.

The biggest thing I took away from this project is how much of what's "wrong" in a published package ecosystem is just human carelessness or lack of attention. Nothing in the gem format actively encourages people to ship .env files or world-writable source files or incluinge API credentials, but nothing prevents it either. Once a gem goes out, it's out here, hence the crown jewels don't get stolen so much as accidentally left in the parking lot. That's funny.

About the Author

Vincent BergVincent Berg is the Chief Technical Officer at Anvil Secure. Vincentโ€™s strong technical background and years of consulting experience drive his belief that technical excellence and professionalism should be at the core of everything we do at Anvil. As CTO, he guides research and technical content, while maintaining a client-focused approach.

Tools

aqlmap - A tool to extract information from ArangoDB through AQL injection. See the introductory blogpost.


awstracer - An Anvil CLI utility that will allow you to trace and replay AWS commands.


awssig - Anvil Secure's Burp extension for signing AWS requests with SigV4.


ByteBanter - A Burp Suite extension that leverages LLMs to generate context-aware payloads for Burp Intruder. See the introductory blogpost.


dawgmon - Dawg the hallway monitor: monitor operating system changes and analyze introduced attack surface when installing software. See the introductory blogpost.


GhidraGarminApp - A Ghidra processor and loader for Garmin watch applications. See the introductory blogpost.


HANAlyzer - A tool that automates SAP HANA security checks and outputs clear HTML reports. See the introductory blogpost.


IPAAutoDec - A tool that decrypts IPA files end-to-end via SSH. See the introductory blogpost.


nanopb-decompiler - Our nanopb-decompiler is an IDA python script that can recreate .proto files from binaries compiled with 0.3.x, and 0.4.x versions of nanopb. See the introductory blogpost.


OffTempo - A Burp Suite extension for statistical timing side-channel analysis. See the introductory blogpost.


PQCscan - A scanner that can determine whether SSH and TLS servers support PQC algorithms. See the introductory blogpost.


SAPCARve - A utility Python script for manipulating SAP's SAR archive files. See the introductory blogpost.


ulexecve - A tool to execute ELF binaries on Linux directly from userland. See the introductory blogpost.


usb-racer - A tool for pentesting TOCTOU issues with USB storage devices.

Recent Posts