Two String Anagram
Introduction
Two words are anagrams when they share the same multiset of letters — triangle and integral, Madam Curie and Radium came. The concept powers Scrabble dictionaries, word games, and a surprisingly useful family of interview questions that all boil down to the same move: stop thinking about the order of the characters and start thinking about how often each one appears.
Two String Anagram is the gateway to that family. It is usually labeled “Easy,” but the gap between a candidate who writes and one who writes with a twenty-six-entry array is enormous, and the gap is revealed in about three lines of code. It also hides a surprisingly stubborn trap: a perfectly reasonable-looking “just sort both strings” approach that is tempting, compiles, passes the obvious examples, and is wrong.
When a problem asks about anagrams, stop comparing strings and start comparing histograms.
This chapter walks all three approaches the book covers — brute force, sorting, and histograms — and uses the middle one to make a point that outlives the problem: a structural insight (“sorting normalizes”) is not automatically a correct algorithm. Spotting that gap is exactly the kind of reasoning senior interviewers are scoring you on.
What the interviewer is actually testing
The rubric on a typical Two String Anagram interview scores four things. Each section below is explicitly organized to deliver them:
- Did you reframe the problem as a frequency question? Candidates who keep thinking in terms of permutations get stuck in . Candidates who say the word “histogram” or “frequency array” within the first minute have already passed a major signal.
- Did you exploit the small alphabet? Twenty-six lowercase letters is space. A solution that uses a
std::unordered_map<char, int>is correct but reveals you have not internalized that the alphabet is bounded. - Did you produce the correct answer — not a plausible one? The sorting approach below is an interview trap. If you propose it, you must also notice that it does not solve the stated problem.
- Did you handle unequal lengths cleanly? A one-line guard, stated up front.
Problem statement
You are given two strings and of lengths and . Return the minimum number of single-character substitutions on that turn it into an anagram of . If no such sequence of substitutions exists, return .
A substitution replaces the -th character of with . No insertions, no deletions. Two strings are anagrams when one can be rearranged to equal the other.
a = "aaa", b = "bbb" 3 a = "tear", b = "fear" 1 a = "Protectional", b = "Lactoprotein" 0 Clarifying questions — and why each one matters
Two String Anagram is short enough that interviewers will push on your clarification discipline just to see whether you have it. Ask these three in your first sixty seconds.
Clarification questions
- Q Is the alphabet restricted? Lowercase English only, Unicode, anything?
- A The small-alphabet answer is the optimal solution. If the interviewer says lowercase English, you know you can use a 26-entry integer array — O(1) space. If they say Unicode, you need
unordered_map<char32_t, int>and your space claim weakens to O(k) where k is the distinct code points seen. The answer shapes the code you write, not just the commentary. - Q Is the comparison case-sensitive?
- A Almost always no; the canonical phrasing assumes lowercase input. If the interviewer says case-insensitive, lowercase both strings up front so the rest of the algorithm does not need to think about case.
- Q What should I return when the strings have different lengths?
- A Return
-1. No number of substitutions can change a string's length, so unequal lengths make the problem infeasible. State this guard out loud before you write the algorithm — it is a point-free signal that you thought about feasibility.
A mental framework for anagram-shape problems
Whenever a problem says “anagram,” “permutation,” “rearrange,” or “same characters in any order,” resist the urge to enumerate orderings. Instead, walk this three-step ladder:
- Translate to multisets. Anagrams are equal multisets of characters. Any claim about an anagram is a claim about frequencies, not positions.
- Pick the right frequency container. A bounded alphabet (ASCII, lowercase English, digits) means an array. An unbounded alphabet (Unicode code points, arbitrary tokens) means a hash map. The first gives space; the second gives .
- Turn the question into arithmetic on the frequency vector. “Are these anagrams?” becomes “are the frequency vectors equal?” “How many edits?” becomes “what is the sum of the positive components of their difference?” This step is where the actual interview move happens.
Every approach below is a variation on how literally it obeys this ladder. Brute force ignores it. Sorting obeys step 1 but stops short of step 3. Histograms execute all three.
Discussion
The feasibility guard, up front
All three approaches share a single line:
if (a.length() != b.length()) return -1;State this out loud before you start comparing characters. It eliminates the only infeasible branch and lets the rest of the solution assume equal length.
Brute force — try every permutation of a
Enumerate every arrangement of a, count the character-by-character difference against b, take the minimum.
The interview narration
“A first, obviously correct idea: generate every permutation of , count positions where it differs from , and return the smallest. I would not ship this — is untenable even for — but I’m writing it to anchor what the optimal must achieve.”
Naming brute force as the baseline and immediately disqualifying it is a deliberate interview move: you prove you see the state space before you prune it.
The code
#include <algorithm>#include <cassert>#include <limits>#include <string>
int count_different_letters(const std::string& a_perm, const std::string& b) { assert(a_perm.size() == b.size()); int count = 0; for (size_t i = 0; i < a_perm.length(); ++i) { if (a_perm[i] != b[i]) { ++count; } } return count;}
int solution_brute_force(const std::string& a, const std::string& b) { if (a.length() != b.length()) { return -1; }
std::string a_perm(a); std::sort(a_perm.begin(), a_perm.end());
int ans = std::numeric_limits<int>::max(); do { ans = std::min(ans, count_different_letters(a_perm, b)); if (ans == 0) { break; } } while (std::next_permutation(a_perm.begin(), a_perm.end()));
return ans;}Generate permutations of a in sorted order via std::next_permutation; for each, count mismatches against b; early-exit if we hit zero.
Why it is
There are permutations of an -character string (fewer with repeats, but the upper bound is ), and each permutation costs to compare. The early exit on a perfect match saves real work only when and are already anagrams, so the worst case is genuinely . For that is already permutations — well past any interview’s tolerance.
What this approach is good for
Nothing in production. In the interview, it exists to be named, dismissed, and used as a foil: the optimal must not need to enumerate orderings.
Sort both strings and count positional mismatches
Sort a and b, then count positions where the sorted versions disagree — a tempting reframing that produces the wrong answer on the original problem.
The tempting reframing
“If anagrams are equal under sorting, then once I sort both strings, I can just count mismatched positions — that is how many characters I have to change.”
This is half right. Sorting is the normalization trick for detecting anagrams: sort(a) == sort(b) is a correct anagram test. But the book’s stated problem is not detection, it is minimum substitutions, and those two questions come apart on inputs with cascading mismatches.
The code, as the book presents it
#include <algorithm>#include <string>
int count_different_letters(const std::string& a_perm, const std::string& b);
int solution_sorting(const std::string& a, const std::string& b) { if (a.length() != b.length()) { return -1; }
std::string aa(a); std::string bb(b);
std::sort(aa.begin(), aa.end()); std::sort(bb.begin(), bb.end());
return count_different_letters(aa, bb);}Sort aa and bb, then reuse count_different_letters from the brute-force approach to tally positional mismatches on the sorted strings.
What the sorting approach does correctly solve
Three useful variants:
- Is an anagram of ? (Yes iff
sort(a) == sort(b).) Tight . - How many substitutions if deletions and insertions are free? The sorted positional-mismatch count does equal this alternative metric — but it is not the one this problem asks for.
- Listing the characters in common. A merge-like pass over the two sorted strings extracts them without building a histogram.
None of those is the stated problem. So we keep moving.
Complexity, precisely
- Time: for the two sorts plus for the scan.
- Space: for the copies of the input (the inputs are immutable by clarification).
Histogram — count frequencies, sum the positive differences Recommended
Walk both strings once, maintaining a 26-entry signed difference vector. The answer is the sum of its positive entries.
The insight, stated cleanly
Let and be the number of occurrences of character in and . Define the difference vector
Two facts:
- Because , the entries of sum to zero: every surplus in is matched by an equal deficit. So , and the sum of the positive entries equals the absolute value of the sum of the negative entries.
- A substitution on changes exactly two entries of — decrementing one by 1 (the character we removed) and incrementing another by 1 (the character we added). One well-chosen substitution knocks one unit off a surplus and one unit off a deficit simultaneously.
Put those together: the minimum number of substitutions is exactly the sum of the positive entries of .
D[c] = F_a[c] − F_b[c] is the only data the algorithm needs; the answer is the sum of its positive entries S⁺. Try the presets below — especially tear / fear — to see where the sorting approach overcounts and the histogram does not.The famous sorting trap. Histogram says 1; sorting says 2.
The widget above makes the algebra concrete: type two strings and the bars on the left (orange, for ) and the right (blue, for ) rise as their letters appear. The chip under each column is — positive chips mark surplus characters in , negative chips mark deficits. Summing the positive chips gives , which is exactly the answer printed on the right. Flip between the tear / fear and aaa / bbb presets to see why the sorting approach overcounts: in tear / fear only a single letter is in surplus, and the widget says S⁺ = 1, whereas sorting the two strings and counting positional mismatches reports 2.
The code
#include <array>#include <string>
int solution_histogram(const std::string& a, const std::string& b) { if (a.length() != b.length()) { return -1; }
std::array<int, 26> D = {0}; for (size_t i = 0; i < a.size(); ++i) { D[a[i] - 'a']++; D[b[i] - 'a']--; }
int ans = 0; for (const int x : D) { if (x > 0) { ans += x; } } return ans;}Walk both strings in a single pass, incrementing for a and decrementing for b. Return the sum of the positive entries of the resulting signed difference vector.
Notice the book-author’s micro-optimization: we never need and separately. A single signed array plays both roles. The += 1 for and -= 1 for are doing bookkeeping for both histograms in one pass.
Correctness argument
We show that the answer is exactly .
() Every substitution reduces by at most 1. A substitution replaces some character with some character in . Three cases for how changes:
- decreases by 1 and increases by 1.
- If was a surplus () and was a deficit (), then drops by exactly 1 (the entry loses a unit of surplus; the entry still contributes 0 to if it is now ).
- Any other choice changes by or — it wastes an operation or makes things worse.
So at best each substitution reduces by . Since we must drive to to become an anagram, we need at least substitutions.
() substitutions suffice. Pair each unit of surplus with one unit of deficit and execute the substitution that exchanges them. Every pairing is legal because guarantees that for every unit of positive surplus there is a unit of negative deficit to absorb it. After such substitutions, is identically zero, so and share a multiset — an anagram.
Combining the two directions, is both a lower bound and achievable, so it is the answer.
Edge cases to call out explicitly
| a | b | Expected | Why it works |
|---|---|---|---|
"" | "" | 0 | Both histograms are empty; nothing to reconcile. |
"a" | "ab" | -1 | Length mismatch — caught by the guard before the histogram runs. |
"abc" | "cba" | 0 | Already an anagram; is the zero vector. |
"aaa" | "bbb" | 3 | , ; sum of positives is . |
"tear" | "fear" | 1 | , ; histogram gets this right where sorting overcounts. |
"aabb" | "bbcc" | 2 | Two surplus as trade for two deficit cs. |
Complexity, precisely
- Time: . One pass over each string. There is no sort, no permutation enumeration, nothing you could remove and still be correct.
- Space: . A 26-entry integer array is a constant regardless of input length. If the alphabet is Unicode, swap the array for an
unordered_map<char32_t, int>and the bound becomes where is the number of distinct code points observed. The algorithm is unchanged.
Approach comparison
| Approach | Time | Space | When would you ship it? |
|---|---|---|---|
| Brute force | Never. It exists as an anchor to prove you see the permutation space. | ||
| Sorting | Only if the problem is anagram detection, not substitution count. | ||
| Histogram | Always, when the alphabet is bounded. This is the answer. |
One axis on which you might stay with sorting: if the interviewer changes the problem to “are these anagrams, yes or no?” and you want one line of code (return sort(a) == sort(b);), the sorted compare is perfectly idiomatic. As soon as the question is about how many edits, go back to the histogram.
Follow-ups to anticipate
| The interviewer asks… | What to do |
|---|---|
| ”What if insertions and deletions are allowed too?” | The problem becomes an edit-distance variant. The tight answer is now the sum of divided by 2 — and you should argue why, since each insertion-deletion pair is symmetric. |
| ”What if the alphabet is Unicode?” | Swap the std::array<int, 26> for an unordered_map<char32_t, int>. Walk both strings once, increment for , decrement for . Sum positive values. Space becomes distinct code points; time is still amortized. |
| ”Can you return which substitutions, not just the count?” | Walk twice: collect surplus characters (with multiplicity) into a list S, deficit characters into a list T. Scan left to right; whenever you see a character in S that still has capacity, emit a substitution to the next T entry. |
| ”What if the input is streamed and you cannot re-read it?” | Keep two 26-entry counters, one per stream, and compute only at the end. Still space, still time, but you cannot use the “single-array delta” trick unless you have both streams interleaved character-for-character. |
| ”Can you do it without the $ | a |
Pattern recognition — the real lesson
The transferable pattern is character-frequency as a first-class data structure. It appears whenever the problem asks about equality-up-to-reordering, and the way to surface it in an interview is to say, out loud, “the order of the characters does not affect the answer, only the multiset, so let me count.” Other chapters in this book that live on the same pattern:
- Two Sum — not a frequency problem, but the same move: replace a search with a lookup.
- 3Sum — the deduplication step is a frequency argument in disguise (count how many of each triplet type you have emitted).
- Coin Change — the DP state is a frequency-like bookkeeping over which amounts you have reached.
- Majority Element (forthcoming) — the entire problem is “which character has the highest frequency.”
- Longest Consecutive Sequence (forthcoming) — uses a hash set as a frequency indicator (1 if present, 0 if not).
The common thread: when the problem is invariant under reordering, the right representation is not the sequence — it is the histogram.
What to say out loud — the delivery script
The interviewer expects a structured narration, not stream of consciousness. Here is a 45-minute version:
- “Before I write code, a few clarifications. Alphabet? Case sensitivity? And what should I return when the strings have different lengths?”
- “A string can only become an anagram of another with substitutions if both have the same length, so I will start with a length guard that returns otherwise.”
- “The obvious brute force is to permute and count mismatches to . That is — I would not ship it, but I want to state the baseline before pruning it.”
- “A natural refinement is to sort both and count positional mismatches. I want to pause on this: it is a famous trap. Let me show you with
tearandfearwhy the sorting count overestimates, and why the real problem is not about positions, but about frequencies.” - “The correct formulation is to count how many of each character each string has, then measure the difference. That takes me to the histogram.”
- “Here’s the key observation: because the strings are the same length, the signed difference vector sums to zero, and each substitution reduces the sum of its positive entries by at most one. So the answer is exactly that sum.”
- “The code is an pass incrementing a 26-entry array for and decrementing for , plus a second pass over the array to sum positives. Total time , space .”
- “Let me test it: on
aaa/bbbI get 3, ontear/fearI get 1, onProtectional/Lactoproteinlowercased I get 0. All three match the specification.” - “Edge cases I explicitly handle: empty strings, unequal lengths, already-anagram inputs. I am confident in the complexity and the correctness argument.”
- “If the alphabet changes to Unicode, I swap the array for a hash map; the algorithm is identical. Otherwise, this is what I would ship.”
Summary
Two String Anagram is an anagram problem masquerading as a string problem. The only representation that matters is the 26-entry frequency vector, and the only piece of arithmetic that matters is the sum of its positive entries. The sorting approach that looks plausible is subtly wrong — and noticing that, out loud, is exactly the kind of reasoning that separates a hire from a pass.
Once you have internalized “sort gives you detection; histogram gives you the count,” the move transfers to every other anagram-family problem you will see this week.