By Fahmida Y. Rashid
June 14, 2017
Ransomware is particularly well suited to linguistic analysis because the attack relies on delivering ransom notes victims can read and understand, says Jon Condra, director of Asia-Pacific Research at Flashpoint. Most malware—even spear phishing—campaigns don’t hold up under this kind of scrutiny because the lure is carefully crafted to look legitimate and resemble something else.
The analysis starts off by collecting all possible text. Restricting certain data sets can lead the analysis down an unexpected path, so it is important to include everything available. For example, the team at Taia looked at 20 messages reported in the media and posted to Pastebin, allegedly from the Sony hackers. Even then, the team noted in the report the amount of data—fewer than 2,000 words in total—was small.
Mistakes in grammar, spelling, punctuation, tense confusion and even word usage can give certain clues. In the case of English, there are certain grammatical errors that native English (US-English) speakers typically don’t make, such as leaving out definite and indefinite articles “a” and “the,” or omitting words such as “to,” “should,” “must” or “will” in sentences. Another clue is using the “-ing” ending incorrectly, such as in the case of “they are go” instead of “they are going.” With these clues, the analyst can generate a suspect list of five possible languages, and then compare each “oddity” to see which language they align most closely with. For example, if “the” is being dropped, that is a good indicator the person is a Russian native speaker or some other Slavic-language. Argamon said the fact that Guccifer 2.0 kept dropping articles during the Twitter interview was evidence the speaker was more likely Russian than Romanian, since Romanian has definite and indefinite articles.
The more errors or language features that can be identified, the more thorough the analysis. The Sony report showed 25 different elements.
However, analyzing language isn’t all that straightforward, since people can speak multiple languages and have different levels of proficiencies. In the case of someone who is a native speaker of Mandarin Chinese but learned to hack from the Russians—learning Russian along the way—and carried out an attack in English, there may be “more features from the L2 [the second language learned] than the L1 [the native language] leaking through in writing the L3 [the third language being used],” Argamon says.
Context matters. The clues may point to a Russian speaker, but if there’s reason to believe the attackers are Chinese, then it could very well be a Chinese attacker who had been trained by the Russians. Linguistics help with triangulating the evidence from other research paths, such as source code evidence and network forensics.