PHP Regular Expressions
Introduction to Regular Expressions
Regular expressions (regex or regexp) are powerful pattern-matching tools that allow you to search, match, and manipulate text using specific patterns. In PHP, regular expressions are implemented through PCRE (Perl Compatible Regular Expressions), providing a robust and flexible system for text processing tasks.
Understanding regular expressions is essential for PHP developers, as they enable sophisticated data validation, text parsing, content filtering, and string manipulation that would be complex or impossible with basic string functions.
Why Regular Expressions Matter
Data Validation: Regular expressions excel at validating complex data formats like emails, phone numbers, credit cards, and custom business rules with precise pattern matching.
Text Processing: They enable sophisticated text manipulation including search and replace operations, content extraction, and data parsing from various sources.
Input Sanitization: Regex patterns help identify and filter potentially harmful input, contributing to application security and data integrity.
Content Analysis: Regular expressions can extract specific information from large text blocks, parse logs, and analyze user-generated content.
Performance: When used correctly, regex can be more efficient than multiple string function calls for complex pattern matching tasks.
Regular Expression Components
Understanding the building blocks of regular expressions is crucial for writing effective patterns:
Patterns: The core regex pattern that defines what to match, enclosed in delimiters (typically /
). The delimiters separate the pattern from modifiers and prevent ambiguity.
Modifiers: Flags that change how the pattern matching behaves, such as case-insensitive matching (i
) or multiline mode (m
). These appear after the closing delimiter.
Metacharacters: Special characters with specific meanings in regex patterns like .
(any character) or *
(zero or more). These must be escaped with backslash when matching literally.
Character Classes: Groups of characters defined within square brackets like [0-9]
for digits. You can also use predefined classes like \d
for digits or \w
for word characters.
Quantifiers: Specify how many times a pattern should match, such as +
(one or more) or {3,5}
(between 3 and 5 times). Understanding greedy vs. lazy quantifiers is essential.
Anchors: Position markers like ^
(start of string) and $
(end of string) that specify where matches should occur. These don't consume characters but assert positions.
PCRE vs. Other Regex Flavors
PHP uses PCRE, which is largely compatible with Perl regex but includes some differences from other implementations like JavaScript or Python regex. Understanding PCRE-specific features helps you write more effective patterns and avoid compatibility issues.
Key PCRE features include:
- Named capture groups:
(?P<name>pattern)
or(?<name>pattern)
- Atomic groups:
(?>pattern)
for preventing backtracking - Possessive quantifiers:
++
,*+
,?+
for performance optimization - Unicode support with the
u
modifier - Extended features like conditional patterns and recursion
Basic Pattern Matching
Fundamental PCRE Functions
PHP provides several functions for working with regular expressions, each serving different purposes. Understanding when to use each function is key to effective regex programming.
<?php
/**
* PHP Regular Expression Basics
*
* PCRE functions provide comprehensive pattern matching capabilities
* for searching, validating, and manipulating text data.
*/
/**
* Basic pattern matching with preg_match()
*
* preg_match() performs a single match and returns 1 if found,
* 0 if not found, or FALSE on error.
*/
function demonstrateBasicMatching(): void
{
$text = "The quick brown fox jumps over the lazy dog";
// Simple word matching
if (preg_match('/fox/', $text)) {
echo "Found 'fox' in the text\n";
}
// Case-insensitive matching with 'i' modifier
if (preg_match('/FOX/i', $text)) {
echo "Found 'FOX' (case-insensitive)\n";
}
}
**Understanding preg_match()**:
The `preg_match()` function is the workhorse of PHP regex, used for:
- **Single Match Detection**: Returns after finding the first match
- **Validation**: Perfect for checking if input matches a pattern
- **Data Extraction**: Can capture matched groups for further processing
- **Performance**: Stops searching after first match, making it efficient for existence checks
Return values have specific meanings:
- `1`: Pattern matched successfully
- `0`: No match found
- `FALSE`: An error occurred (invalid pattern, etc.)
Always check for `FALSE` explicitly with `===` to distinguish from `0`.
```php
// Pattern with capture groups
$pattern = '/(\w+)\s+brown\s+(\w+)/';
if (preg_match($pattern, $text, $matches)) {
echo "Full match: {$matches[0]}\n";
echo "First word: {$matches[1]}\n";
echo "Third word: {$matches[2]}\n";
}
Capture Groups Explained:
Parentheses create capture groups that extract parts of the match:
$matches[0]
: Always contains the full match$matches[1]
,$matches[2]
, etc.: Contain captured subpatterns- Groups are numbered left-to-right by opening parenthesis position
- Non-capturing groups
(?:...)
match without storing results
This is invaluable for parsing structured data where you need specific parts.
// Using anchors
$email = "[email protected]";
// Match entire string (anchored)
if (preg_match('/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/', $email)) {
echo "Valid email format\n";
}
Anchors and Full String Matching:
The ^
and $
anchors are crucial for validation:
- Without anchors: Pattern can match anywhere in the string
- With anchors: Entire string must match the pattern
- This prevents partial matches like finding "[email protected]" within "[email protected]"
The email pattern breakdown:
[a-zA-Z0-9._%+-]+
: Local part (before @)@
: Literal @ symbol[a-zA-Z0-9.-]+
: Domain name\.
: Escaped dot (literal period)[a-zA-Z]{2,}
: Top-level domain (at least 2 letters)
// Pattern with named capture groups (PHP 5.2.2+)
$pattern = '/(?P<username>\w+)@(?P<domain>[a-zA-Z0-9.-]+)/';
if (preg_match($pattern, $email, $matches)) {
echo "Username: {$matches['username']}\n";
echo "Domain: {$matches['domain']}\n";
}
}
Named Capture Groups:
Named groups improve code readability and maintenance:
- Access by name:
$matches['username']
instead of$matches[1]
- Self-documenting: Pattern shows what each group captures
- Position-independent: Can reorder groups without breaking code
- Both numeric and named access work:
$matches[1]
equals$matches['username']
/**
* Finding all matches with preg_match_all()
*
* preg_match_all() finds all matches in a string and can return
* results in different formats based on flags.
*/
function demonstrateMatchAll(): void
{
$text = "Contact us at [email protected] or [email protected] for assistance";
// Find all email addresses
$emailPattern = '/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/';
if (preg_match_all($emailPattern, $text, $matches)) {
echo "Found " . count($matches[0]) . " email addresses:\n";
foreach ($matches[0] as $email) {
echo "- $email\n";
}
}
}
**preg_match_all() vs preg_match()**:
Key differences:
- **Multiple Matches**: Finds all occurrences, not just the first
- **Result Structure**: Returns array of arrays
- **Use Cases**: Data extraction, counting occurrences, parsing multiple items
- **Performance**: Scans entire string, potentially slower on large texts
The default result format groups by capture group:
- `$matches[0]`: Array of all full matches
- `$matches[1]`: Array of all first capture group matches
- And so on...
```php
// Using capture groups with PREG_SET_ORDER flag
$phoneText = "Call (555) 123-4567 or (555) 987-6543 for more information";
$phonePattern = '/\((\d{3})\)\s*(\d{3})-(\d{4})/';
if (preg_match_all($phonePattern, $phoneText, $matches, PREG_SET_ORDER)) {
echo "\nPhone numbers found:\n";
foreach ($matches as $match) {
echo "Full: {$match[0]}, Area: {$match[1]}, Exchange: {$match[2]}, Number: {$match[3]}\n";
}
}
}
**Result Ordering with Flags**:
`PREG_SET_ORDER` changes how results are organized:
- Default: Groups by capture group number
- `PREG_SET_ORDER`: Groups by match occurrence
- Makes iteration over complete matches easier
- Better for processing each match as a unit
Phone pattern breakdown:
- `\(`: Escaped opening parenthesis (literal)
- `(\d{3})`: Capture exactly 3 digits (area code)
- `\)\s*`: Closing parenthesis followed by optional whitespace
- `(\d{3})`: Capture 3 digits (exchange)
- `-`: Literal hyphen
- `(\d{4})`: Capture 4 digits (number)
```php
/**
* Pattern validation examples
*/
function demonstrateValidation(): void
{
// Common validation patterns
$patterns = [
'email' => '/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/',
'phone' => '/^\+?1?[-.\s]?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})$/',
'zip_code' => '/^\d{5}(-\d{4})?$/',
'credit_card' => '/^\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}$/',
'username' => '/^[a-zA-Z0-9_]{3,20}$/',
'password' => '/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$/',
'url' => '/^https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)$/',
'ipv4' => '/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/'
];
$testData = [
'email' => ['[email protected]', 'invalid.email', '[email protected]'],
'phone' => ['(555) 123-4567', '555-123-4567', '+1-555-123-4567', '123-456'],
'zip_code' => ['12345', '12345-6789', '1234', 'ABCDE'],
'username' => ['john_doe', 'user123', 'ab', 'very_long_username_that_exceeds_limit'],
'url' => ['https://example.com', 'http://www.example.com/path', 'not-a-url', 'ftp://example.com']
];
foreach ($testData as $type => $values) {
echo "\nValidating $type:\n";
foreach ($values as $value) {
$isValid = preg_match($patterns[$type], $value);
$status = $isValid ? 'VALID' : 'INVALID';
echo " $value: $status\n";
}
}
}
// Run demonstrations
demonstrateBasicMatching();
echo "\n" . str_repeat('-', 50) . "\n";
demonstrateMatchAll();
echo "\n" . str_repeat('-', 50) . "\n";
demonstrateValidation();
?>
Advanced Pattern Techniques
Advanced regex techniques allow for more sophisticated pattern matching and better performance. Understanding these concepts elevates your regex skills from basic to professional level.
<?php
/**
* Advanced Regular Expression Techniques
*
* These examples demonstrate more sophisticated regex patterns
* including lookaheads, lookbehinds, and complex matching scenarios.
*/
/**
* Lookaheads and lookbehinds for context-sensitive matching
*/
function demonstrateLookarounds(): void
{
// Positive lookahead: Match 'Java' only if followed by 'Script'
$text = "I know Java, JavaScript, and JavaBeans";
$pattern = '/Java(?=Script)/';
if (preg_match_all($pattern, $text, $matches)) {
echo "Found 'Java' followed by 'Script': " . count($matches[0]) . " times\n";
}
}
**Lookaround Assertions Explained**:
Lookarounds are zero-width assertions that check context without consuming characters:
**Positive Lookahead `(?=...)`**:
- Matches if pattern ahead exists
- Doesn't include lookahead in match
- Useful for: Password validation, conditional matching
- Example: Match 'Java' only in 'JavaScript'
**Negative Lookahead `(?!...)`**:
- Matches if pattern ahead doesn't exist
- Excludes unwanted contexts
- Example: Match 'Java' but not in 'JavaScript'
```php
// Positive lookbehind: Match numbers preceded by '$'
$priceText = "Items cost $19.99, €25.50, and $45.00";
$pattern = '/(?<=\$)\d+\.\d{2}/';
if (preg_match_all($pattern, $priceText, $matches)) {
echo "USD prices found:\n";
foreach ($matches[0] as $price) {
echo "- \$$price\n";
}
}
Lookbehind Assertions:
Positive Lookbehind (?<=...)
:
- Matches if pattern behind exists
- Perfect for currency, units, prefixes
- Must be fixed-width in PCRE (no
*
or+
) - Example: Extract prices after dollar signs
Negative Lookbehind (?<!...)
:
- Matches if pattern behind doesn't exist
- Useful for excluding specific contexts
- Example: Match words not preceded by specific characters
// Complex example: Extract function names from PHP code
$phpCode = '
function getUserData($id) { }
private function validateInput($data) { }
public static function formatDate($date) { }
';
// Match function names with visibility and static modifiers
$pattern = '/(?:(?:public|private|protected)?\s*(?:static\s+)?function\s+)(\w+)/';
if (preg_match_all($pattern, $phpCode, $matches)) {
echo "Function names found:\n";
foreach ($matches[1] as $functionName) {
echo "- $functionName\n";
}
}
}
Non-Capturing Groups (?:...)
:
Benefits of non-capturing groups:
- Performance: Slightly faster (no storage overhead)
- Cleaner Results: Don't clutter
$matches
array - Logical Grouping: Group patterns without side effects
- Optional Sections: Make entire sections optional
The PHP function pattern uses several techniques:
- Optional visibility modifiers
- Optional static keyword
- Required 'function' keyword
- Captures only the function name
// ... existing code ...