PHP의 정규식 파서?

PHP의 정규 표현식을 구문 분석해야합니다. 정규 표현식을 작성하거나 실행하는 데 아무런 문제가 없지만 정규 표현식에 대한 정보를 표시하려고합니다 (예 : 캡처 그룹 나열, 반복 문자 첨부). 전체 프로젝트는 WordPress 용 플러그인으로, 대체 패턴이있는 정규식 인 다시 쓰기 규칙에 대한 정보를 제공하며 이해하기가 쉽지 않습니다.PHP의 정규식 파서?

나는 자신이 작성한 간단한 정규 표현식을 처리하는 것으로 보이는 문구를 a simple implementation이라고 쓰고 있습니다. 이 예제를 확장하여 정규식 구문을 더 많이 지원하기 전에 내가 볼 수있는 다른 좋은 구현이 있는지 알고 싶습니다. 구현 언어는 별 문제가되지 않습니다. 필자는 대부분의 파서가 일치 속도를 최적화하기 위해 작성되었다고 가정하지만 이는 중요하지 않으며 명확성을 저해 할 수도 있습니다.

출처

2011-01-04 Jan Fabry

정규식을 사용해 보셨습니까? 오, 아니, 이미 12 가지 문제가 있습니다. O –

@Ivo : 사실, 처음 구현 한 것은 정규 표현식을 기반으로했지만 너무 복잡해져서 간단한 문자 기반 루프로 전환하지 않았습니다. –

당신은 이것을 http://xenon.stanford.edu/~xusch/regexp/analyzer.html 과 같이 구현하려고합니다. –

글쎄, 당신은 PHP에서 정규식 함수의 구현을 살펴볼 수 있습니다. PHP는 공개 소스 프로젝트이기 때문에 모든 소스와 문서를 공개 할 수 있습니다.

출처

2011-01-04 13:39:14 Ivan

감사합니다. 그러나 PCRE 라이브러리 (PHP가 사용하는)는 속도가 매우 최적화되어 있으므로 내 요구에 적합하지 않습니다. . –

필요한 것은 문법과 파서를 생성하는 방법입니다. 파서를 만드는 가장 쉬운 방법은 대상 언어 (예 : PHP)에서 직접 재귀 적 강하를 코딩하는 것입니다.이 문법에서는 구문 분석기와 똑같은 모양의 깨끗한 파서를 만들 수 있습니다 (파서의 유지 관리도 가능함). 당신이 문법을 일단이에 할 방법에 대한 자세한 내용의

로트, 내 SO description of how to build recursive descent parsers 및

, 간단한 문법 (당신이 생각했던 아마 사람) 인 정규식 문법에 관해서는 additional theory details here에 제공됩니다

REGEX = ALTERNATIVES ; 
ALTERNATIVES = TERM ('|' TERM)* ; 
TERM = '(' ALTERNATIVES ')' | CHARACTER | SET | TERM ('*' | '+' | '?') ; 
SET = '~' ? '[' (CHARACTER | CHARACTER '-' CHARACTER)* ']' ; 
CHARACTER = 'A' | 'B' | ... | '0' ... '9' | ... ;

이 문법을 처리하기 위해 PHP로 작성된 재귀 적 파서는 몇백 줄 정도되어야합니다.

이 장소를 시작으로 PHP Regexes의 기능을 추가 할 수 있어야합니다.

해피 파싱!

출처

2011-03-25 02:37:19

내가 지난 여름에 한 프로젝트에 관심이있을 수 있습니다. 그것은 PCRE 호환 정규 표현식의 동적 구문 강조 기능을 제공하는 자바 스크립트 프로그램입니다 :

참조 : Dynamic (?:Regex Highlighting)++ with Javascript!
과 associated tester page
과 GitHub project page

코드 사용 (자바 스크립트)을 (PCRE를 따로 선택하는 정규 표현식) regex를 다양한 부분에 적용하고 마크 업을 적용하여 사용자가 다양한 구성 요소에 마우스를 올리고 대괄호를보고 그룹 번호를 캡처 할 수 있도록합니다.

는 (나는 내가 어떤 더 잘 알고하지 않았기 때문에 정규 표현식을 사용하여 작성! 8 ^)

출처

2011-03-25 21:11:47 ridgerunner

내가 PHP에 액션 1/2 정규 표현식 라이브러리를 번역하려고합니다. 이전 버전의 Flash에는 원시 정규 표현식 지원이 없었기 때문에 AS에 몇 개의 라이브러리가 작성되었습니다. 하나의 동적 언어에서 다른 동적 언어로 번역하는 것은 C를 해독하려고 시도하는 것보다 훨씬 쉬워야합니다.내가 누구의 요구 사항에 당신과 매우 유사 Debuggex의 창조자, http://www.jurjans.lv/flash/RegExp.html

출처

2013-03-26 14:02:04 cleong

해요 : :

여기에서 찾고 하나 개의 링크 아마도 가치가 표시 될 수있는 정보의 양을 최적화.

다음은 Debuggex가 사용하는 파서에서 크게 수정 된 (미리보기 쉬운) 코드 단편입니다. 있는 그대로 작동하지는 않지만 코드 구성을 보여주기위한 것입니다. 대부분의 오류 처리가 제거되었습니다. 따라서 많은 논리들이 간단하지만 장황했다.

recursive descent이 사용됩니다. 이것은 파서에서 한 것입니다. 단 하나의 함수로 병합됩니다. 나는 나의 대략이 문법을 사용 :

Regex -> Alt 
Alt -> Cat ('|' Cat)* 
Cat -> Empty | (Repeat)+ 
Repeat -> Base (('*' | '+' | '?' | CustomRepeatAmount) '?'?) 
Base -> '(' Alt ')' | Charset | Literal 
Charset -> '[' (Char | Range | EscapeSeq)* ']' 
Literal -> Char | EscapeSeq 
CustomRepeatAmount -> '{' Number (',' Number)? '}'

당신은 그냥 정규 표현식에의 자바 스크립트 맛의 특수성에 대해 다루고 있습니다 내 코드를 많이 알 수 있습니다. 자세한 내용은 this reference에서 확인할 수 있습니다. PHP의 경우 this에 필요한 모든 정보가 있습니다. 나는 당신이 파서와 함께 잘 지내고 있다고 생각합니다. 남아있는 모든 것은 나머지 사업자를 구현하고 핵심 사례를 올바르게 구현하는 것입니다.

는 :) 즐기십시오 :

var Parser = function(s) { 
    this.s = s; // This is the regex string. 
    this.k = 0; // This is the index of the character being parsed. 
    this.group = 1; // This is a counter for assigning to capturing groups. 
}; 

// These are convenience methods to make reading and maintaining the code 
// easier. 
// Returns true if there is more string left, false otherwise. 
Parser.prototype.more = function() { 
    return this.k < this.s.length; 
}; 
// Returns the char at the current index. 
Parser.prototype.peek = function() { // exercise 
}; 
// Returns the char at the current index, then advances the index. 
Parser.prototype.next = function() { // exercise 
}; 
// Ensures c is the char at the current index, then advances the index. 
Parser.prototype.eat = function(c) { // exercise 
}; 

// We use a recursive descent parser. 
// This returns the root node of our tree. 
Parser.prototype.parseRe = function() { 
    // It has exactly one child. 
    return new ReTree(this.parseAlt()); 
    // We expect that to be at the end of the string when we finish parsing. 
    // If not, something went wrong. 
    if (this.more()) { 
    throw new Error(); 
    } 
}; 

// This parses several subexpressions divided by |s, and returns a tree 
// with the corresponding trees as children. 
Parser.prototype.parseAlt = function() { 
    var alts = [this.parseCat()]; 
    // Keep parsing as long as a we have more pipes. 
    while (this.more() && this.peek() === '|') { 
    this.next(); 
    // Recursive descent happens here. 
    alts.push(this.parseCat()); 
    } 
    // Here, we allow an AltTree with single children. 
    // Alternatively, we can return the child if there is only one. 
    return new AltTree(alts); 
}; 

// This parses several concatenated repeat-subexpressions, and returns 
// a tree with the corresponding trees as children. 
Parser.prototype.parseCat = function() { 
    var cats = []; 
    // If we reach a pipe or close paren, we stop. This is because that 
    // means we are in a subexpression, and the subexpression is over. 
    while (this.more() && ')|'.indexOf(this.peek()) === -1) { 
    // Recursive descent happens here. 
    cats.push(this.parseRepeat()); 
    } 
    // This is where we choose to handle the empty string case. 
    // It's easiest to handle it here because of the implicit concatenation 
    // operator in our grammar. 
    return (cats.length >= 1) ? new CatTree(cats) : new EmptyTree(); 
}; 

// This parses a single repeat-subexpression, and returns a tree 
// with the child that is being repeated. 
Parser.prototype.parseRepeat = function() { 
    // Recursive descent happens here. 
    var repeat = this.parseBase(); 
    // If we reached the end after parsing the base expression, we just return 
    // it. Likewise if we don't have a repeat operator that follows. 
    if (!this.more() || '*?+{'.indexOf(this.peek()) === -1) { 
    return repeat; 
    } 

    // These are properties that vary with the different repeat operators. 
    // They aren't necessary for parsing, but are used to give meaning to 
    // what was parsed. 
    var min = 0; var max = Infinity; var greedy = true; 
    if (this.peek() === '*') { // exercise 
    } else if (this.peek() === '?') { // exercise 
    } else if (this.peek() === '+') { 
    // For +, we advance the index, and set the minimum to 1, because 
    // a + means we repeat the previous subexpression between 1 and infinity 
    // times. 
    this.next(); min = 1; 
    } else if (this.peek() === '{') { /* challenging exercise */ } 

    if (this.more() && this.peek() === '?') { 
    // By default (in Javascript at least), repetition is greedy. Appending 
    // a ? to a repeat operator makes it reluctant. 
    this.next(); greedy = false; 
    } 
    return new RepeatTree(repeat, {min:min, max:max, greedy:greedy}); 
}; 

// This parses a "base" subexpression. We defined this as being a 
// literal, a character set, or a parnthesized subexpression. 
Parser.prototype.parseBase = function() { 
    var c = this.peek(); 
    // If any of these characters are spotted, something went wrong. 
    // The) should have been eaten by a previous call to parseBase(). 
    // The *, ?, or + should have been eaten by a previous call to parseRepeat(). 
    if (c === ')' || '*?+'.indexOf(c) !== -1) { 
    throw new Error(); 
    } 
    if (c === '(') { 
    // Parse a parenthesized subexpression. This is either a lookahead, 
    // a capturing group, or a non-capturing group. 
    this.next(); // Eat the (. 
    var ret = null; 
    if (this.peek() === '?') { // excercise 
     // Parse lookaheads and non-capturing groups. 
    } else { 
     // This is why the group counter exists. We use it to enumerate the 
     // group appropriately. 
     var group = this.group++; 
     // Recursive descent happens here. Note that this calls parseAlt(), 
     // which is what was initially called by parseRe(), creating 
     // a mutual recursion. This is where the name recursive descent 
     // comes from. 
     ret = new MatchTree(this.parseAlt(), group); 
    } 
    // This MUST be a) or something went wrong. 
    this.eat(')'); 
    return ret; 
    } else if (c === '[') { 
    this.next(); // Eat the [. 
    // Parse a charset. A CharsetTree has no children, but it does contain 
    // (pseudo)chars and ranges, and possibly a negation flag. These are 
    // collectively returned by parseCharset(). 
    // This piece can be structured differently depending on your 
    // implementation of parseCharset() 
    var opts = this.parseCharset(); 
    // This MUST be a ] or something went wrong. 
    this.eat(']'); 
    return new CharsetTree(opts); 
    } else { 
    // Parse a literal. Like a CharsetTree, a LiteralTree doesn't have 
    // children. Instead, it contains a single (pseudo)char. 
    var literal = this.parseLiteral(); 
    return new LiteralTree(literal); 
    } 
}; 

// This parses the inside of a charset and returns all the information 
// necessary to describe that charset. This includes the literals and 
// ranges that are accepted, as well as whether the charset is negated. 
Parser.prototype.parseCharset = function() { 
    // challenging exercise 
}; 

// This parses a single (pseudo)char and returns it for use in a LiteralTree. 
Parser.prototype.parseLiteral = function() { 
    var c = this.next(); 
    if (c === '.' || c === '^' || c === '$') { 
    // These are special chars. Their meaning is different than their 
    // literal symbol, so we set the 'special' flag. 
    return new CharInfo(c, true); 
    } else if (c === '\\') { 
    // If we come across a \, we need to parse the escaped character. 
    // Since parsing escaped characters is similar between literals and 
    // charsets, we extracted it to a separate function. The reason we 
    // pass a flag is because \b has different meanings inside charsets 
    // vs outside them. 
    return this.parseEscaped({inCharset: false}); 
    } 
    // If neither case above was hit, we just return the exact char. 
    return new CharInfo(c); 
}; 

// This parses a single escaped (pseudo)char and returns it for use in 
// either a LiteralTree or a CharsetTree. 
Parser.prototype.parseEscaped = function(opts) { 
    // Here we instantiate some default options 
    opts = opts || {}; 
    inCharset = opts.inCharset || false; 

    var c = peek(); 
    // Here are a bunch of escape sequences that require reading further 
    // into the string. They are all fairly similar. 
    if (c === 'c') { // exercises 
    } else if (c === '0') { 
    } else if (isDigit(c)) { 
    } else if (c === 'x') { 
    } else if (c === 'u') { 
    // Use this as an example for implementing the ones above. 
    // A regex may be used for this portion, but I think this is clearer. 
    // We make sure that there are exactly four hexadecimal digits after 
    // the u. Modify this for the escape sequences that your regex flavor 
    // uses. 
    var r = ''; 
    this.next(); 
    for (var i = 0; i < 4; ++i) { 
     c = peek(); 
     if (!isHexa(c)) { 
     throw new Error(); 
     } 
     r += c; 
     this.next(); 
    } 
    // Return a single CharInfo desite having read multiple characters. 
    // This is why I used "pseudo" previously. 
    return new CharInfo(String.fromCharCode(parseInt(r, 16))); 
    } else { // No special parsing required after the first escaped char. 
    this.next(); 
    if (inCharset && c === 'b') { 
     // Within a charset, \b means backspace 
     return new CharInfo('\b'); 
    } else if (!inCharset && (c === 'b' || c === 'B')) { 
     // Outside a charset, \b is a word boundary (and \B is the complement 
     // of that). We mark it one as special since the character is not 
     // to be taken literally. 
     return new CharInfo('\\' + c, true); 
    } else if (c === 'f') { // these are left as exercises 
    } else if (c === 'n') { 
    } else if (c === 'r') { 
    } else if (c === 't') { 
    } else if (c === 'v') { 
    } else if ('dDsSwW'.indexOf(c) !== -1) { 
    } else { 
     // If we got to here, the character after \ should be taken literally, 
     // so we don't mark it as special. 
     return new CharInfo(c); 
    } 
    } 
}; 

// This represents the smallest meaningful character unit, or pseudochar. 
// For example, an escaped sequence with multiple physical characters is 
// exactly one character when used in CharInfo. 
var CharInfo = function(c, special) { 
    this.c = c; 
    this.special = special || false; 
}; 

// Calling this will return the parse tree for the regex string s. 
var parse = function(s) { return (new Parser(s)).parseRe(); };

출처

2013-03-27 06:17:22

이것은 비슷한 도구를 생각 나게합니다 : [Regexper] (http://www.regexper.com/) – zessx

펄 모듈 YAPE::Regex::Explain 모듈이 아마 아주 쉽게 PHP로 포팅 할 수 있습니다. 여기에 출력

C:\>perl -e "use YAPE::Regex::Explain;print YAPE::Regex::Explain->new(qr/['-])->explain;" 
The regular expression: 

(?-imsx:['-]) 

matches as follows: 

NODE      EXPLANATION 
---------------------------------------------------------------------- 
(?-imsx:     group, but do not capture (case-sensitive) 
         (with^and $ matching normally) (with . not 
         matching \n) (matching whitespace and # 
         normally): 
---------------------------------------------------------------------- 
    ['-]      any character of: ''', '-' 
---------------------------------------------------------------------- 
)      end of grouping 
---------------------------------------------------------------------- 



C:\>perl -e "use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/(\w+), ?(.)/)->explain;" 
The regular expression: 

(?-imsx:(\w+), ?(.)) 

matches as follows: 

NODE      EXPLANATION 
---------------------------------------------------------------------- 
(?-imsx:     group, but do not capture (case-sensitive) 
         (with^and $ matching normally) (with . not 
         matching \n) (matching whitespace and # 
         normally): 
---------------------------------------------------------------------- 
    (      group and capture to \1: 
---------------------------------------------------------------------- 
    \w+      word characters (a-z, A-Z, 0-9, _) (1 or 
          more times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
)      end of \1 
---------------------------------------------------------------------- 
    ,      ',' 
---------------------------------------------------------------------- 
    ?      ' ' (optional (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    (      group and capture to \2: 
---------------------------------------------------------------------- 
    .      any character except \n 
---------------------------------------------------------------------- 
)      end of \2 
---------------------------------------------------------------------- 
)      end of grouping 
---------------------------------------------------------------------- 

C:\>

당신은 the source code을보고 빠르게 구현을 볼 수의 예입니다.

출처

2013-03-28 19:17:39 bubba

답변

관련 문제