You’ve done everything right. You wrote the logic, crafted meticulous unit tests for every success and failure path, and watched your CI pipeline glow a satisfying green. The code is shipped. For weeks, everything works perfectly.
Then, the weird bug reports start trickling in. A duplicate record in the database. A corrupted file. An error that "should be impossible" because your code explicitly checks for that condition. You stare at your code, you stare at your tests, and you can’t see a single thing wrong.
If this sounds familiar, you might have fallen victim to one of the most subtle and frustrating bugs in software development: the time-of-check to time-of-use (TOCTOU) race condition.
When developers hear “race condition,” our minds usually jump to the classic textbook example: multiple threads trying to increment the same counter.
# A classic (and oversimplified) race condition
shared_counter = 0
def increment():
# Thread A reads shared_counter (0)
# Thread B reads shared_counter (0)
# Thread A calculates 0 + 1
# Thread B calculates 0 + 1
# Thread A writes 1 to shared_counter
# Thread B writes 1 to shared_counter
shared_counter += 1
# Expected result: 2, Actual result: 1
This happens because the operation isn’t atomic. The read, modify, and write steps can be interleaved between threads, leading to incorrect state. While these are tricky, we have tools like mutexes and locks to manage them, and we can often simulate them in specialized multi-threaded tests.
But the race condition we’re talking about today is different. It’s sneakier because it doesn’t require complex multi-threading in your application code. It can happen in a standard web server handling two simple, simultaneous requests.
The silent race condition I'm talking about stems from a very logical, very common, and very flawed pattern: Check-Then-Act.
It looks like this:
The fatal flaw is the tiny, imperceptible gap between the "check" and the "act." In that gap, the state of the world can change.
While your code is moving from line 5 to line 8, another process, another thread, or another server instance running the same code could have already acted, invalidating your initial check.
Let's look at a typical user registration function in a web application. The requirement is simple: usernames must be unique.
Here’s the intuitive, but flawed, way to write it:
// A standard Express.js route handler
app.post('/register', async (req, res) => {
const { username, password } = req.body;
// 1. CHECK
const existingUser = await db.users.findOne({ where: { username } });
if (existingUser) {
return res.status(409).send({ error: 'Username already taken.' });
}
// 2. ACT
const newUser = await db.users.create({ username, password });
return res.status(201).send(newUser);
});
This code looks perfectly reasonable. It checks if the user exists and only creates one if it doesn't.
Now, imagine two users, Alice and Bob, trying to register with the exact same username, clever_dev
, at nearly the same time.
findOne
query runs. No user named clever_dev
is found. existingUser
is null
.findOne
query runs. Alice's transaction hasn't committed yet, so no user named clever_dev
is found. existingUser
is null
.if
block and executes db.users.create()
. Alice's account is created.if
block and executes db.users.create()
.What happens next depends on your database.
* Best Case: You have a UNIQUE
constraint on the username
column. The database throws an integrity violation error on Bob's request, and your server crashes with an unhandled exception.
* Worst Case: You forgot to add a UNIQUE
constraint. The database happily creates a second user with the same username. Your application now has corrupt data, leading to all sorts of future bugs, like "which clever_dev
is trying to log in?"
This is the most insidious part. Your unit tests for this logic will pass with flying colors. Why?
test_registration_succeeds_for_new_user
and another for test_registration_fails_for_existing_user
. They will never run concurrently to expose the race condition.findOne
is called, return null
" for the success test, and "when findOne
is called, return a user object" for the failure test. You are explicitly controlling the world, preventing the state from ever changing unexpectedly between the check and the act.Your tests are validating the logic in an idealized, single-file line. The production environment is a chaotic crowd.
The solution is to stop separating the "check" and the "act." We need to combine them into a single, atomic operation and let the authoritative source of truth (the database, the filesystem) do the work of enforcing uniqueness.
Instead of checking first, just try to perform the action and gracefully handle the failure that occurs if the state isn't what you expected.
Here’s the refactored, robust version of our registration function:
// The robust version
app.post('/register', async (req, res) => {
const { username, password } = req.body;
try {
// 1. ACT directly
const newUser = await db.users.create({ username, password });
return res.status(201).send(newUser);
} catch (error) {
// 2. The "check" is now handling the error from the Act
if (error.name === 'SequelizeUniqueConstraintError') {
return res.status(409).send({ error: 'Username already taken.' });
}
// For other unexpected errors
return res.status(500).send({ error: 'Something went wrong.' });
}
});
Prerequisite: This code requires a UNIQUE
constraint on the username
column in your database schema.
Now, when Alice and Bob's requests come in, the first one to execute the create
call will succeed. The second one will attempt to create a user with a username that now exists, violating the UNIQUE
constraint. The database will reject the operation and throw an error, which our catch
block correctly interprets as a "username already taken" conflict.
The check and act are now one atomic database operation.
This isn't just about databases. The same principle applies to other systems:
if (!fileExists(path)) { createFile(path); }
, use file open flags like O_CREAT | O_EXCL
which atomically create a file and fail if it already exists.UPDATE seats SET owner_id = ? WHERE seat_id = ? AND owner_id IS NULL
. Then check how many rows were affected. If 0, the seat was already taken. If 1, you got it.The gap between checking a state and acting on it is a minefield for concurrency bugs. While it seems logical, the "Check-Then-Act" pattern is an anti-pattern in any system that handles more than one request at a time.
Key Takeaways:
Shifting your mindset from pre-checking to handling failures will not only make your code more robust but will also save you from those head-scratching, "impossible" production bugs that your tests could never find.
If you found this deep dive helpful, please follow for more practical insights into building reliable and scalable software.