请问这个正则表达式如何写

推荐学习书目

Learn Python the Hard Way

Python Sites

PyPI - Python Package Index

http://diveintopython.org/toc/index.html

Pocoo

值得关注的项目

PyPy

Celery

Jinja2

Read the Docs

gevent

pyenv

virtualenv

Sentry

Shovel

Pyflakes

pytest

Python 编程

pep8 Checker

Styles

PEP 8

Google Python Style Guide

Code Style from The Hitchhiker's Guide

This topic created in 4006 days ago, the information mentioned may be changed or developed.

<dt><a name="313"></a>ADHE 313 (6) Organization of Adult Basic Education Programs</dt>

想抓出ADHE313和Organization of Adult Basic Education Programs

programs

adult

表达式

18 replies 2015-05-29 10:45:43 +08:00

asj

May 28, 2015

这难道不是应该用CSS/JQuery selector，或者XPath么？

phx13ye

May 28, 2015

<\/a>(.*)(.*?)<\/b>

sicongliu

May 28, 2015

xpath比较简单但是想学下正则的方法

shoumu

May 28, 2015

看一下pyquery吧，支持jQuery的语法

professorz

May 28, 2015

.+<\\/a>(.+)(6)(.+)<\\/b>.+
java下的regex

sicongliu

May 28, 2015

python的如何写

yiyiwa

May 28, 2015

python测试了一下，不完善，有空的东西。

'\>([^\<]*)\<'

sicongliu

May 28, 2015

m=re.search("</a>(.*?)\s(",text)
print (m.group(1))

m=re.search("(.*?)(",text)
print (m.group(1))

sicongliu

May 28, 2015

如果要取ADHE 313呢？
如何判断第二个空格？当然用字符串的search切片功能很容易达到，只是想知道正则如何达到

sicongliu

May 28, 2015

m=re.search("</a>(.*?)\s+\(",text)
print (m.group(1))

当然方法比较笨，如果第二个空格后不是“(”就没办法了

asj

May 28, 2015

简单写了一个，还很不完善
(?:<dt.*?>)(?:.*?\/.*?>)([\w ]*)(?:.*?)(?:<\/dt>)

http://regexr.com/3b3bs

May 28, 2015

这个需求不用正则，会简单得多

page.xpath("//dt/text()") -> ADHE 313 (6)
page.xpath("//dt/b/text()") -> Organization of Adult Basic Education Programs

picasso250

May 28, 2015

/a>([\w ()]+)([\w ]+)
最简单的解决了你现在的问题。

picasso250

May 28, 2015

对不起，上一个是错误的，多提取了(6)

/a>(\w+ \d+).+?([\w ]+)

leozy2014

May 28, 2015

print re.findall('</a>(.*?) \(6\) (.*?)</dt>', s)
#[('ADHE 313', 'Organization of Adult Basic Education Programs')]

wmttom

May 28, 2015

python正则 (?<=>)[\w, ,\(,\)]+?(?= \(|<)

re.findall("(?<=>)[\w, ,\(,\)]+?(?= \(|<)", '<dt><a name="313"></a>ADHE 313 (6) Organization of Adult Basic Education Programs</dt>')

['ADHE 313', 'Organization of Adult Basic Education Programs']

sicongliu

May 29, 2015

楼上两个貌似都不能用

sicongliu

May 29, 2015

sorry这个可行

print re.findall('</a>(.*?) \(6\) (.*?)</dt>', s)
#[('ADHE 313', 'Organization of Adult Basic Education Programs')]