Malicious Web & Document Files: Phishing & Drive-By Downloads
The phishing and exploit-kit landscape. Drive-by download mechanics. Weaponised PDFs (/OpenAction, /JS), RTF (Equation Editor), Office macros (VBA, P-Code), and JavaScript de-obfuscation techniques.
Why documents and websites became the dominant delivery vector
Antivirus engines became efficient at scanning executables. Application whitelisting blocked unknown EXEs. Network filters caught suspicious downloads. So attackers shifted: instead of making the victim run an executable, they make the victim open a document or visit a page. The document or page contains code that exploits a vulnerability or social-engineers the user into authorising the next stage.
Today, the majority of initial-access compromises go through one of three vectors: a malicious document attached to a phishing email, a drive-by download from a compromised website, or a malicious link delivered via collaboration tools. This lesson covers all three.
Drive-by downloads
A drive-by download is an attack in which visiting a web page is sufficient to infect the visitor's machine. The user does not click anything or download a file manually. The page contains code (typically in a hidden <iframe> or injected <script> tag) that exploits a browser or plugin vulnerability to execute a payload.
The typical infection chain has three stages:
- The victim visits a legitimate website that has been compromised — through a stolen FTP credential, a vulnerable CMS plugin, or a malicious advertisement injected into the ad network. The attacker has placed a hidden
<iframe>pointing to an exploit kit server. - The victim's browser silently loads the iframe content from the EK server. The EK fingerprints the browser version, installed plugins (Flash, Java, Silverlight, in their day), and OS version.
- If a matching exploit exists in the EK's arsenal, the EK delivers a payload (typically a dropper or RAT) that executes without user interaction.
Exploit kits were server-side frameworks that automated this — fingerprinting visitors, selecting an appropriate exploit, delivering the payload, tracking infection statistics. Notable examples included Angler (peaked 2015), RIG, and Magnitude. The EK era declined as Flash and Java were retired from browsers, but the architecture survives in modern browser-based attacks via JavaScript and WebAssembly.
Malicious URL techniques
Attackers craft deceptive URLs to trick users into visiting malicious infrastructure:
- Typosquatting — registers domains with minor spelling differences:
gogle.com,amaz0n.com. Compare character-by-character; check WHOIS registration date. - Homograph (IDN) attacks — substitutes Latin characters with visually identical Cyrillic or Greek characters. The browser's address bar displays the Punycode form (
xn--...); anyxn--URL warrants scrutiny. - Subdomain abuse — the real domain is the rightmost portion.
paypal.com.evil-site.netis owned byevil-site.net, not PayPal. - URL shorteners — hide the final destination. Use
urlexpand.comorunfurl.ioto resolve before clicking.
Weaponised PDFs
PDF files are deceptively powerful. They support embedded JavaScript, embedded files, automatic action triggers, and rich interactive forms — all useful for legitimate workflows, all abusable.
The two PDF-specific abuse vectors:
- /OpenAction tag — a PDF object that fires automatically when the document is opened. An attacker places JavaScript or an embedded launch action here. The user opens the PDF; the action runs without further interaction.
- /JS (JavaScript) blocks — Adobe Reader and many other PDF readers execute JavaScript embedded in PDFs. Combined with a /OpenAction trigger, this becomes arbitrary code execution on document open.
A PDF polyglot is a file that is simultaneously a valid PDF and a valid file of another format — typically a JAR or ZIP. The PDF reader sees a PDF; the OS, asked to "execute" it, sees an executable. Polyglots evade format-based filtering.
Tools: pdfid and pdf-parser (Didier Stevens) for static structural analysis; peepdf for interactive PDF exploration.
Weaponised RTF
RTF (Rich Text Format) was once a benign Microsoft text format. It became a major attack vector because of Equation Editor — an old Microsoft component (EQNEDT32.EXE) bundled with Office. CVE-2017-11882 and CVE-2018-0802 were memory corruption vulnerabilities in Equation Editor; an RTF document containing a crafted equation object would crash Equation Editor in a way that yielded code execution.
Microsoft eventually removed Equation Editor from Office 2018+. But RTF remains an attractive vector because:
- Many security tools treat RTF as low-risk text.
- RTF can embed OLE objects, including ActiveX controls.
- Antivirus parsers struggle with RTF's permissive syntax — multiple ways exist to encode the same content, defeating signatures.
Microsoft Office macros
The textbook vector. Office documents support VBA macros — full programs that run inside Word, Excel, or PowerPoint. Malicious documents persuade the user (via a fake "this document was created in a newer version, click Enable Content to view") to enable macros. The macro then executes whatever the attacker wrote.
Common macro patterns:
- AutoOpen / Document_Open / Workbook_Open — VBA functions that fire automatically when the document opens (after the user enables macros).
- Shell() — VBA function that spawns a process. Often used to invoke
powershell.exeorcmd.exewith arguments to download a stage-2 payload. - VBA Stomping (P-Code attacks) — the source code in the file may be benign decoy text, while the compiled P-Code is the actual malicious logic. The mismatch defeats source-code-only analysis.
Tools: oletools suite (Philippe Lagadec) — olevba extracts macros, oleid summarises document risk, mraptor flags suspicious patterns.
JavaScript de-obfuscation
Both browser exploits and embedded document scripts arrive heavily obfuscated. Two practical techniques:
The eval-interception principle
Most obfuscated JavaScript ends with eval(decoded) or a functionally equivalent dynamic execution call. Replace eval with console.log or document.write and run the obfuscated code. The deobfuscated payload prints to console without executing.
// Original obfuscated:
eval(function(p,a,c,k,e,d){...}(...));
// Modified for analysis:
console.log(function(p,a,c,k,e,d){...}(...));
Browser DevTools breakpoints
For inline obfuscation that does not pass through eval, set a breakpoint at the top of the obfuscation routine in the Sources panel of Chrome/Edge DevTools. Step through; inspect variables; the deobfuscated content sits in memory by the time the script reaches its final action.
What you should be comfortable with after this lesson
- Explaining the three-stage drive-by chain
- Spotting deceptive URL patterns (typo, homograph, subdomain abuse) at a glance
- Listing PDF-specific abuse vectors and the tools to find them
- Extracting and reading a VBA macro with
olevba - De-obfuscating a JavaScript payload using the eval-interception technique
References
pdfid, pdf-parser, the canonical PDF analysis suite.
toololevba, mraptor, oleid — the standard kit for malicious Office document analysis.
toolExcellent practical reference for de-obfuscation patterns.
referenceSubmit suspicious documents; VT detonates them in sandboxes and reports observed behaviour.
toolExercises
Identify the URL trick
For each URL, name the trick: (1) https://www.gogIe.com (capital I), (2) https://paypal.com.security-check.tk, (3) https://xn--ggle-0nda.com. Verify your answers using a Punycode converter.
Run pdfid on a PDF
Take any sample PDF (Didier Stevens publishes test files). Run pdfid. Identify whether it contains /OpenAction, /JS, or /JavaScript. If yes, extract the JS with pdf-parser.
De-obfuscate a real macro
Pick a malicious-document sample from MalwareBazaar (filter by tag = doc). Run olevba. Find the AutoOpen entry point. Deobfuscate the payload manually (or via olevba's --reveal flag). Identify the C2 URL.
