Justin's Words

正则零宽断言

今天下载了个字幕文件,不过里面的分行太多,明明只是简单的一个不长句子,硬是被分成两行,于是我决定把这些分行全部去掉,字幕文件部分片段如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
2
00:01:15,452 --> 00:01:18,843
This is Zero One Alpha. We
have secured the Falcon.
3
00:01:18,868 --> 00:01:21,476
I say again,We have
secured the Falcon.

4
00:01:22,952 --> 00:01:26,452
I will count from one to ten. Within
that you'll tell me what I wanna know.

5
00:01:27,052 --> 00:01:28,752
Otherwise,

6
00:01:29,177 --> 00:01:32,152
<span style="color: #00ff00;">the number ten is
the last thing you'll hear.

我希望完成后的效果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
2
00:01:15,452 --> 00:01:18,843
This is Zero One Alpha. We have secured the Falcon.

3
00:01:18,868 --> 00:01:21,476
I say again,We have secured the Falcon.

4
00:01:22,952 --> 00:01:26,452
I will count from one to ten. Within that you'll tell me what I wanna know.

5
00:01:27,052 --> 00:01:28,752
Otherwise,

6
00:01:29,177 --> 00:01:32,152
<span style="color: #00ff00;">the number ten is the last thing you'll hear.</span>

字幕文件总行超过 6000 行,不可能手动去处理,于是打算通过编程解决,我打算用 Python。

首先定义一个字符变量:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
str = '''
2
00:01:15,452 --> 00:01:18,843
This is Zero One Alpha. We
have secured the Falcon.

3
00:01:18,868 --> 00:01:21,476
I say again,We have
secured the Falcon.

4
00:01:22,952 --> 00:01:26,452
I will count from one to ten. Within
that you'll tell me what I wanna know.

5
00:01:27,052 --> 00:01:28,752
Otherwise,

6
00:01:29,177 --> 00:01:32,152
<font color="#00ff00">the number ten is
the last thing you'll hear.</font>
'''

现在的问题是如何匹配两个字母之间的换行同时只匹配 \n 而不匹配前后两个字母。


正则的零宽断言包括:

  • (?=exp) 正向前瞻(Lookahead),后面是 exp 则匹配,或匹配 exp 前面的位置
  • (?<=exp) 负向前瞻(Lookbehind),前面是 exp 则匹配,或匹配 exp 后面的位置
  • (?!exp) 正向后瞻(Negative Lookahead),后面不是 exp 则匹配,或匹配前面不是 exp 的位置
  • (?<!exp) 负向后瞻(Negative Lookbehind),前面不是 exp 则匹配,或匹配后面不是 exp 的位置

注意:JavaScript 不支持正则后瞻。
比如你要获取一对标签间的内容,比如 <p>Hello world</p>,这时可以用 (?<=<p>)(.*)(?=<\\p>) 来获取。


现在来解决字幕的问题,先写好正则模式:

1
2
3
import re

pattern = re.compile(r'(?<=[a-zA-Z])(\n)(?=[a-zA-Z])')

接下来使用 re.sub() 进行替换:

1
2
convertStr = re.sub(patter, ' ', str)
print(convertStr)

对于整个字幕文件也是如此,先打开字幕文件,获取内容,对内容进行替换,然后将替换完成的结果写入一个新的文件(不直接在原文件操作是个好习惯),完成的代码如下:

1
2
3
4
5
6
7
8
9
10
11
import re

f = open('foo.srt', 'r');
content = f.read();
pattern = re.compile(r'(?<=[a-zA-Z,.])(\n)(?=[a-zA-Z])')
convertContent = re.sub(pattern, ' ', content)
f.close()

f = open('bar.srt', 'w');
f.write(convertContent)
f.close()

参考: