Malicious Web & Document Files: Phishing & Drive-By Downloads

Why documents and websites became the dominant delivery vector

Antivirus engines became efficient at scanning executables. Application whitelisting blocked unknown EXEs. Network filters caught suspicious downloads. So attackers shifted: instead of making the victim run an executable, they make the victim open a document or visit a page. The document or page contains code that exploits a vulnerability or social-engineers the user into authorising the next stage.

Today, the majority of initial-access compromises go through one of three vectors: a malicious document attached to a phishing email, a drive-by download from a compromised website, or a malicious link delivered via collaboration tools. This lesson covers all three.

Drive-by downloads

A drive-by download is an attack in which visiting a web page is sufficient to infect the visitor's machine. The user does not click anything or download a file manually. The page contains code (typically in a hidden <iframe> or injected <script> tag) that exploits a browser or plugin vulnerability to execute a payload.

The typical infection chain has three stages:

The victim visits a legitimate website that has been compromised — through a stolen FTP credential, a vulnerable CMS plugin, or a malicious advertisement injected into the ad network. The attacker has placed a hidden <iframe> pointing to an exploit kit server.
The victim's browser silently loads the iframe content from the EK server. The EK fingerprints the browser version, installed plugins (Flash, Java, Silverlight, in their day), and OS version.
If a matching exploit exists in the EK's arsenal, the EK delivers a payload (typically a dropper or RAT) that executes without user interaction.

Exploit kits were server-side frameworks that automated this — fingerprinting visitors, selecting an appropriate exploit, delivering the payload, tracking infection statistics. Notable examples included Angler (peaked 2015), RIG, and Magnitude. The EK era declined as Flash and Java were retired from browsers, but the architecture survives in modern browser-based attacks via JavaScript and WebAssembly.

Malicious URL techniques

Attackers craft deceptive URLs to trick users into visiting malicious infrastructure:

Typosquatting — registers domains with minor spelling differences: gogle.com, amaz0n.com. Compare character-by-character; check WHOIS registration date.
Homograph (IDN) attacks — substitutes Latin characters with visually identical Cyrillic or Greek characters. The browser's address bar displays the Punycode form (xn--...); any xn-- URL warrants scrutiny.
Subdomain abuse — the real domain is the rightmost portion. paypal.com.evil-site.net is owned by evil-site.net, not PayPal.
URL shorteners — hide the final destination. Use urlexpand.com or unfurl.io to resolve before clicking.

Weaponised PDFs

PDF files are deceptively powerful. They support embedded JavaScript, embedded files, automatic action triggers, and rich interactive forms — all useful for legitimate workflows, all abusable.

The two PDF-specific abuse vectors:

/OpenAction tag — a PDF object that fires automatically when the document is opened. An attacker places JavaScript or an embedded launch action here. The user opens the PDF; the action runs without further interaction.
/JS (JavaScript) blocks — Adobe Reader and many other PDF readers execute JavaScript embedded in PDFs. Combined with a /OpenAction trigger, this becomes arbitrary code execution on document open.

A PDF polyglot is a file that is simultaneously a valid PDF and a valid file of another format — typically a JAR or ZIP. The PDF reader sees a PDF; the OS, asked to "execute" it, sees an executable. Polyglots evade format-based filtering.

Tools: pdfid and pdf-parser (Didier Stevens) for static structural analysis; peepdf for interactive PDF exploration.

Weaponised RTF

RTF (Rich Text Format) was once a benign Microsoft text format. It became a major attack vector because of Equation Editor — an old Microsoft component (EQNEDT32.EXE) bundled with Office. CVE-2017-11882 and CVE-2018-0802 were memory corruption vulnerabilities in Equation Editor; an RTF document containing a crafted equation object would crash Equation Editor in a way that yielded code execution.

Microsoft eventually removed Equation Editor from Office 2018+. But RTF remains an attractive vector because:

Many security tools treat RTF as low-risk text.
RTF can embed OLE objects, including ActiveX controls.
Antivirus parsers struggle with RTF's permissive syntax — multiple ways exist to encode the same content, defeating signatures.

Microsoft Office macros

The textbook vector. Office documents support VBA macros — full programs that run inside Word, Excel, or PowerPoint. Malicious documents persuade the user (via a fake "this document was created in a newer version, click Enable Content to view") to enable macros. The macro then executes whatever the attacker wrote.

Common macro patterns:

AutoOpen / Document_Open / Workbook_Open — VBA functions that fire automatically when the document opens (after the user enables macros).
Shell() — VBA function that spawns a process. Often used to invoke powershell.exe or cmd.exe with arguments to download a stage-2 payload.
VBA Stomping (P-Code attacks) — the source code in the file may be benign decoy text, while the compiled P-Code is the actual malicious logic. The mismatch defeats source-code-only analysis.

Tools: oletools suite (Philippe Lagadec) — olevba extracts macros, oleid summarises document risk, mraptor flags suspicious patterns.

JavaScript de-obfuscation

Both browser exploits and embedded document scripts arrive heavily obfuscated. Two practical techniques:

The eval-interception principle

Most obfuscated JavaScript ends with eval(decoded) or a functionally equivalent dynamic execution call. Replace eval with console.log or document.write and run the obfuscated code. The deobfuscated payload prints to console without executing.

// Original obfuscated:
eval(function(p,a,c,k,e,d){...}(...));

// Modified for analysis:
console.log(function(p,a,c,k,e,d){...}(...));

Browser DevTools breakpoints

For inline obfuscation that does not pass through eval, set a breakpoint at the top of the obfuscation routine in the Sources panel of Chrome/Edge DevTools. Step through; inspect variables; the deobfuscated content sits in memory by the time the script reaches its final action.

What you should be comfortable with after this lesson

Explaining the three-stage drive-by chain
Spotting deceptive URL patterns (typo, homograph, subdomain abuse) at a glance
Listing PDF-specific abuse vectors and the tools to find them
Extracting and reading a VBA macro with olevba
De-obfuscating a JavaScript payload using the eval-interception technique

Malicious Web & Document Files: Phishing & Drive-By Downloads

Why documents and websites became the dominant delivery vector

Drive-by downloads

Malicious URL techniques

Weaponised PDFs

Weaponised RTF

Microsoft Office macros

JavaScript de-obfuscation

The eval-interception principle

Browser DevTools breakpoints

What you should be comfortable with after this lesson

References

Exercises

Identify the URL trick

Run pdfid on a PDF

De-obfuscate a real macro