'Miscellany'에 해당되는 글 16건

2008.04.09 A Tao of Regular Expressions
2008.03.26 TmaxDay 2008을 다녀오고
2008.01.21 What American accent do you have? 1
2007.12.31 Remote Desktop port override
2007.12.31 Striping to Raid1+0 Migration
2007.08.13 Init()

A Tao of Regular Expressions

Miscellany 2008. 4. 9. 18:49

A Tao of Regular Expressions

Steve Mansour
sman@scruznet.com
Revised: June 5, 1999 (copied by jm /at/ jmason.org from http://www.scruz.net/%7esman/regexp.htm, after the original disappeared! )

C O N T E N T S

What Are Regular Expressions
Examples
   Simple
   Medium (Strange Incantations)
   Hard (Magical Hieroglyphics)
Regular Expressions In Various Tools

What Are Regular Expressions

A regular expression is a formula for matching strings that follow some pattern. Many people are afraid to use them because they can look confusing and complicated. Unfortunately, nothing in this write up can change that. However, I have found that with a bit of practice, it's pretty easy to write these complicated expressions. Plus, once you get the hang of them, you can reduce hours of laborious and error-prone text editing down to minutes or seconds. Regular expressions are supported by many text editors, class libraries such as Rogue Wave's Tools.h++, scripting tools such as awk, grep, sed, and increasingly in interactive development environments such as Microsoft's Visual C++.

Regular expressions usage is explained by examples in the sections that follow. Most examples are presented as vi substitution commands or as grep file search commands, but they are representative examples and the concepts can be applied in the use of tools such as sed, awk, perl and other programs that support regular expressions. Have a look at Regular Expressions In Various Tools for examples of regular expression usage in other tools. A short explanation of vi's substitution command and syntax is provided at the end of this document.

Regular Expression Basics

Regular expressions are made up of normal characters and metacharacters. Normal characters include upper and lower case letters and digits. The metacharacters have special meanings and are described in detail below.

In the simplest case, a regular expression looks like a standard search string. For example, the regular expression "testing" contains no metacharacters. It will match "testing" and "123testing" but it will not match "Testing".

To really make good use of regular expressions it is critical to understand metacharacters. The table below lists metacharacters and a short explanation of their meaning.

*Metacharacter*		*Description*

`.`		Matches any single character. For example the regular expression `r.t` would match the strings rat, rut, r t, but not root.
`$`		Matches the end of a line. For example, the regular expression `weasel$` would match the end of the string "He's a weasel" but not the string "They are a bunch of weasels."
^		Matches the beginning of a line. For example, the regular expression `^When in` would match the beginning of the string "When in the course of human events" but would not match "What and When in the" .
*``**		Matches zero or more occurences of the character immediately preceding. For example, the regular expression *`.`** means match any number of any characters.
`\`		This is the quoting character, use it to treat the following character as an ordinary character. For example, `\$` is used to match the dollar sign character ($) rather than the end of a line. Similarly, the expression \. is used to match the period character rather than any single character.
`[ ]` `[c1-c2]` `[^c1-c2]`		Matches any one of the characters between the brackets. For example, the regular expression `r[aou]t` matches rat, rot, and rut, but not ret. Ranges of characters can specified by using a hyphen. For example, the regular expression `[0-9]` means match any digit. Multiple ranges can be specified as well. The regular expression `[A-Za-z]` means match any upper or lower case letter. To match any character except those in the range, the complement range, use the caret as the first character after the opening bracket. For example, the expression `[^269A-Z]` will match any characters except 2, 6, 9, and upper case letters.
`\< \>`		Matches the beginning (\<) or end (\>) or a word. For example, `\<the` matches on "the" in the string "for the wise" but does not match "the" in "otherwise". NOTE: this metacharacter is not supported by all applications.
``		Treat the expression between $ and $ as a group. Also, saves the characters matched by the expression into temporary holding areas. Up to nine pattern matches can be saved in a single regular expression. They can be referenced as `\1` through `\9`.
`\|`		Or two conditions together. For example `(him\|her)` matches the line "it belongs to him" and matches the line "it belongs to her" but does not match the line "it belongs to them." NOTE: this metacharacter is not supported by all applications.
`+`		Matches one or more occurences of the character or regular expression immediately preceding. For example, the regular expression `9+` matches 9, 99, 999. NOTE: this metacharacter is not supported by all applications.
`?`		Matches 0 or 1 occurence of the character or regular expression immediately preceding.NOTE: this metacharacter is not supported by all applications.
`\{`i`\}` `\{`i`,`j`\}`		Match a specific number of instances or instances within a range of the preceding character. For example, the expression `A[0-9]\{3\}` will match "A" followed by exactly 3 digits. That is, it will match A123 but not A1234. The expression `[0-9]\{4,6\}` any sequence of 4, 5, or 6 digits. NOTE: this metacharacter is not supported by all applications.

The simplest metacharacter is the dot. It matches any one character (excluding the newline character). Consider a file named test.txt consisting of the following lines:

he is a rat

he is in a rut

the food is Rotten

I like root beer

We can use grep to test our regular expressions. Grep uses the regular expression we supply and tries to match it to every line of the file. It prints all lines where the regular expression matches at least one sequence of characters on a line. The command

grep r.t test.txt

searches for the regular expression r.t in each line of test.txt and prints the matching lines. The regular expression r.t matches an r followed by any character followed by a t. It will match rat and rut. It does not match the Rot in Rotten because regular expressions are case sensitive. To match both the upper and lower the square brackets (character range metacharacters) can be used. The regular expression [Rr] matches either R or r. So, to match an upper or lower case r followed by any character followed by the character t the regular expression [Rr].t will do the trick.

To match characters at the beginning of a line use the circumflex character (sometimes called a caret). For example, to find the lines containing the word "he" at the beginning of each line in the file test.txt you might first think the use the simple expression he. However, this would match the in the third line. The regular expression ^he only matches the h at the beginning of a line.

Sometimes it is easier to indicate something what should not be matched rather than all the cases that should be matched. When the circumflex is the first character between the square brackets it means to match any character which is not in the range. For example, to match he when it is not preceded by t or s, the following regular expression can be used: [^st]he.

Several character ranges can be specified between the square brackets. For example, the regular expression [A-Za-z] matches any letter in the alphabet, upper or lower case. The regular expression [A-Za-z][A-Za-z]* matches a letter followed by zero or more letters. We can use the + metacharacter to do the same thing. That is, the regular expression [A-Za-z]+ means the same thing as [A-Za-z][A-Za-z]*. Note that the + metacharacter is not supported by all programs that have regular expressions. See Regular Expressions Syntax Support for more details.

To specify the number of occurrences matched, use the braces (they must be escaped with a backslash). As an example, to match all instances of 100 and 1000 but not 10 or 10000 use the following: 10\{2,3\}. This regular expression matches a the digit 1 followed by either 2 or 3 0's. A useful variation is to omit the second number. For example, the regular expression 0\{3,\} will match 3 or more successive 0's.

Simple Examples

Here are a few representative, simple examples.

*vi command*	*What it does*

*`:%s/ / /g`**	Change 1 or more spaces into a single space.
*`:%s/ $//`**	Remove all spaces from the end of the line.
`:%s/^/ /`	Insert a space at the beginning of every line.
*`:%s/^[0-9][0-9] //`**	Remove all numbers at the beginning of a line.
`:%s/b[aeio]g/bug/g`	Change all occurences of bag, beg, big, and bog, to bug.
`:%s/t$[aou]$g/h\1t/g`	Change all occurences of tag, tog, and tug to hat, hot, and hug respectively.

Medium Examples (Strange Incantations)

Example 1

Change all instances of foo(a,b,c) to foo(b,a,c). where a, b, and c can be any parameters supplied to foo(). That is, we must be able to make changes like the following:

Before		After
`foo(10,7,2)`		`foo(7,10,2)`
`foo(x+13,y-2,10)`		`foo(y-2,x+13,10)`
`foo( bar(8), x+y+z, 5)`		`foo( x+y+z, bar(8), 5)`

The following substitution command will do the trick :

:%s/foo($[^,]*$,$[^,]*$,$[^)]*$)/foo(\2,\1,\3)/g

Now, let's break this apart and analyze what's happening. The idea behind this expression is to identify invocations of foo() with three parameters between the parentheses. The first parameter is identified by the regular expression $[^,]*$, which we can analyze from the inside out.

`[^,]`		means any character which is not a comma
*`[^,]`**		means 0 or more characters which are not commas
*`$[^,]$`**		tags the non-comma characters as `\1` for use in the replacement part of the command
*`$[^,]$,`**		means that we must match 0 or more non-comma characters which are followed by a comma. The non-comma characters are tagged.

This is a good time to point out one of the most common problems people have with regular expressions. Why would we use an expression like [^,]*, instead of something more straightforward like .*, to match the first parameter? Consider applying the pattern .*, to the string "10,7,2". Should it match "10," or "10,7," ? To resolve this ambiguity, regular expressions will always match the longest string possible. In this case "10,7," which covers two parameters instead of one parameter like we want. So, by using the expression [^,]*, we force the pattern to match all characters up to the first comma.

The expression up to this point is: foo($[^,]*$, and can be roughly translated as "after you find foo( tag all characters up to the next comma as \1". We tag the second parameter just like the first and it can be referenced as \2. The tag used on the third parameter is exactly like the others except that we search for all characters up to the right parenthesis. It may be superfluous to search for the last parameter since we don't have to move it. But this pattern guarantees that we update only those instances of foo() where 3 parameters are specified. In these times of function and method overloading, being explicit often proves to be useful. In the substitution portion of the command, we explicitly enter the invocation of foo() as we want it, referencing the matched patterns in the new order where the first and second parameter have been switched.

Example 2

We have a CSV (comma separated value) file with information we need, but in the wrong format. The columns of data are currently arranged in the following order: Name, Company Name, State, Postal Code. We need to reorganize the data into the following order in order to use it with a particular piece of software: Name, State-Postal Code, Company Name. This means that we must change the order of the columns in addition to merging two columns to form a new column value. The particular piece of software that needs this data will not work if there are any whitespace characters (spaces or tabs) before or after the commas. So we must remove whitespace around the commas.

Here are a few lines from the data we have:

Bill Jones, HI-TEK Corporation , CA, 95011

Sharon Lee Smith, Design Works Incorporated, CA, 95012

B. Amos , Hill Street Cafe, CA, 95013

Alexander Weatherworth, The Crafts Store, CA, 95014

...

We need to transform them to look like this:

Bill Jones,CA 95011,HI-TEK Corporation

Sharon Lee Smith,CA 95012,Design Works Incorporated

B. Amos,CA 95013,Hill Street Cafe

Alexander Weatherworth,CA 95014,The Crafts Store

...

We'll look at two regular expressions to solve this problem. The first moves the columns around and merges the data. The second removes the excess spaces.

Here is the first pass at a substitution command that will solve the problem:

:%s/$[^,]*$,$[^,]*$,$[^,]*$,$.*$/\1,\3 \4,\2/

The approach is similar to that of Example 1. The Name is matched by the expression $[^,]*$, that is, all characters up to the first comma. The name can then be referenced as \1 in the replacement pattern. The Company Name and State fields are matched just like the Name field and are referenced as \2 and \3 in the replacement pattern. The last field is matched with the expression $.*$ which can be translated as "match all characters through the end of the line". The replacement pattern is constructed by calling out each tagged expression in the appropriate order and adding or not adding the delimeter.

The following substitution command will remove the excess spaces:

:%s/[ \t]*,[ \t]*/,/g

To break it down: [ \t] matches a space or tab character; [ \t]* matches 0 or more spaces or tabs; [ \t]*, matches 0 or more spaces or tabs followed by a comma; and finally [ \t]*,[ \t]* matches 0 or more spaces or tabs followed by a comma followed by 0 or more spaces or tabs. In the replacement pattern, we simply replace whatever we matched with a single comma. The optional g parameter is added to the end of the substitution command to apply the substitution to all commas in the line.

Example 3

Suppose you have a multi-character sequence that repeats. For example, consider the following:

Billy tried really hard
Sally tried really really hard
Timmy tried really really really hard
Johnny tried really really really really hard

Now suppose you want to change "really", "really really", and any number of consecutive "really" strings to a single word: "very". The command

:%s/$really $$really $*/very /

changes the text above to:

Billy tried very hard
Sally tried very hard
Timmy tried very hard
Johnny tried very hard

The expression $really $* matches 0 or more sequences of "really ". The sequence $really $$really $* matches one or more instances of the sequence "really ".

Hard Examples (Magical Hieroglyphics)

coming soon.

Regular Expressions In Various Tools

OK, you'd like to use regular expressions, but you can't bring yourself to use vi. Here, then, are a few examples of how to use regular expressions in other tools. Also, I have attempted to summarize the differences in regular expressions you will find between different programs.

You can use regular expressions in the Visual C++ editor. Select Edit->Replace, then be sure to check the checkbox labled "Regular expression". For vi expressions of the form :%s/pat1/pat2/g set the Find What field to pat1 and the Replace with field to pat2. To simulate the range (% in this case) and the g option you will have to use the Replace All button or appropriate combinations of Find Next and Replace

sed

Sed is a Stream EDitor which can be used to make changes to files or pipes. For complete details, see the man page sed(1).

Here are a few interesting sed scripts. Assume that we're processing a file called price.txt. Note that the edits don't actually happen to the input file, sed simply processes each line of the file with the command you supply and echos the result to its standard out.

*sed script*		*Description*

`sed 's/^$/d' price.txt`		removes all empty lines
*`sed 's/^[ \t]$/d' price.txt`**		removes all lines containing only whitespace
`sed 's/"//g' price.txt`		remove all quotation marks

awk

Awk is a programming language which can be used to perform sophisticated analysis and manipulation of text data. For complete details, see the man page awk(1). Its peculiar name is an acronym made up of the first character of its authors last names (Aho, Weinberger, and Kernighan).

There are many good awk examples in the book The AWK Programming Language (written by Aho, Weinberger, and Kernighan). Please don't form any broad opinions about awk's capabilities based on the following trivial sample scripts. For purposes of these examples, assume that we're working with a file called price.txt. As with sed, awk simply echos its output to its standard out.

*awk script*		*Description*

`awk '$0 !~ /^$/' price.txt`		removes all empty lines
`awk 'NF > 0' price.txt`		a better way to remove all lines in awk
`awk '$2 ~ /^[JT]/ {print $3}' price.txt`		print the third field of all lines whose second field begins with 'J' or 'T'
`awk '$2 !~ /[Mm]isc/ {print $3 + $4}' price.txt`		for all lines whose second field does not contain 'Misc' or 'misc' print the sum of columns 3 and 4 (assumed to be numbers).
*`awk '$3 !~ /^[0-9]+\.[0-9]$/ {print $0}' price.txt`**		print all lines where field 3 is not a number. The number must be of the form: `d.d` or `d.` where `d` is any number of digits from 0 to 9.
`awk '$2 ~ /John\|Fred/ {print $0}' price.txt`		print the entire line if the second field contains 'John' or 'Fred'

grep

grep is a program used to match regular expressions in one or more specified files or in an input stream. Its name programming language which can be used to perform data manipulation on files or pipes. For complete details, see the man page grep(1). Its peculiar name stems from its roots as a command in vi, g/re/pmeaning global regular expression print.

For the examples below, assume we have the text below in a file named phone.txt. Its format is last name followed by a comma, first name followed by a tab, then a phone number.

Francis, John 5-3871

Wong, Fred 4-4123

Jones, Thomas 1-4122

Salazar, Richard 5-2522

*grep command*		*Description*

`grep '\t5-...1' phone.txt`		print all the lines in phone.txt where the phone number begins with 5 and ends with 1. Note that the tab character is represented by `\t`.
*`grep '^S[^ ] R' phone.txt`**		print lines where the last name begins with S and first name begins with R.
`grep '^[JW]' phone.txt`		print lines where the last name begins with J or W
`grep ', ....\t' phone.txt`		print lines where the first name is 4 characters. The tab character is represented by `\t`.
`grep -v '^[JW]' phone.txt`		print lines that do not begin with J or W
`grep '^[M-Z]' phone.txt`		print lines where the last name begins with any letter from M to Z.
*`grep '^[M-Z].[12]' phone.txt`**		print lines where the last name begins with a letter from M to Z and where the phone number ends with a 1 or 2.

egrep

egrep is an extended version of grep. It supports a few more metacharacters in its regular expressions. For the examples below, assume we have the text below in a file named phone.txt. Its format is last name followed by a comma, first name followed by a tab, then a phone number.

Francis, John 5-3871

Wong, Fred 4-4123

Jones, Thomas 1-4122

Salazar, Richard 5-2522

*egrep command*		*Description*

`egrep '(John\|Fred)' phone.txt`		print all lines that contain the name John or Fred.
`egrep 'John\|22$\|^W' phone.txt`		print lines that contain John or that end with 22 or that begin with W.
`egrep 'net(work)?s' report.txt`		print lines in report.txt contain networks or nets.

Regular Expressions Syntax Support

Command or Environment	`.`	`[ ]`	`^`	`$`	``	`\{ \}`	`?`	`+`	`\|`	`( )`
vi	X	X	X	X	X
Visual C++	X	X	X	X	X
awk	X	X	X	X			X	X	X	X
sed	X	X	X	X	X	X
Tcl	X	X	X	X	X		X	X	X	X
ex	X	X	X	X	X	X
grep	X	X	X	X	X	X
egrep	X	X	X	X	X		X	X	X	X
fgrep	X	X	X	X	X
perl	X	X	X	X	X		X	X	X	X

The vi Substitution Command

Vi's substitution command has the form

:

range

s/

pat1

/

pat2

/g

where

:

range

10,20

.,$

.+2,$-5

s is the substitution command.

pat1 is the regular expression to be searched for. This paper is full of examples.

pat2

g is optional. When present the substitution is made to all matches on the line. When it is not present, the substitution is applied only to the first match on the line.

There are many online manuals for vi that provide more complete detail. This page has a number of good vi links and information.

Posted by in0de

TmaxDay 2008을 다녀오고

Miscellany 2008. 3. 26. 14:59

의욕은 좋은데
뭔가 포인트가 어긋났다.

Why SW Stack

소위 'Tmax Software stack'을 주창하며 내세운 논리는

독점의 OS - MS
기술의 DB - Oracle

벤더 종속성으로, 횡포에 휘둘릴 수 밖에 없다.

이에따라 Tmax가 개방형 SW stack을

OS
DB
Middleware
Application

로 구성하여 Total solution provider가 되겠다는 것이다.

상황 오판

하지만, 이런 stack을 구성함에 있어,
Tmax 제품군의 core에 이기종의 제품을 붙여도 작동하도록
개방형 SW Stack을 지향하겠다고 주장하는 것은
정황적으로나, 경험적으로나 납득이 어렵다.

M/F가 아닌 이상, 어떠한 솔루션을 채택할 것인가 하는 점은
pSeries-AIX와 같은 결합을 제외하고는
무슨 DB를 올리든, 무슨 미들웨어를 올리든 현재로서도
'특정 벤더사를 위한 조직으로 전락' 까지의 상황은 아니다.
market share를 뺏고 싶다고 솔직하게 말하는 게 낫겠다.
문제는 고객이 토탈 패키지가 갖춰져있다는 이유로 선호해주지는 않는다는 점.

개방, 호환 그리고 Tmax의 기술적 한계

상황 파악에 대한 의문점을 차치하고라도,
지금의 consolidation 프로젝트의 복잡성을 크게 떨어뜨릴 수 있을만큼
Tmax에서 generic한 layer 내지는 stack을 만들 수 있는가 역시 회의적이다.

왜냐하면 Tmax는 당장 OpenFrame에서의 XA 통신 규약조차 정확히 구현하지 못하여
in-doubt transaction이 발생하는 상황에 대해
transaction coordination은 middleware의 role임에도 불구하고
"오라클에서는 잘 되는데 DB2에서는 안되니, IBM에 가서 물어봐라"
라고 답변하는 곳이었기 때문이다.

L사이트의 성공 사례라고 써있는 Proframe은
프로젝트와 동시에 프레임워크를 만들고 있으며(!)
원래부터 들어와서 만드는 것, 사이트마다 버전이 다른 것으로 유명하다.
SE들이 코딩을 하다 오류가 발생하면
대체 자신의 오류인지, 로직의 오류인지,
프레임워크의 오류인지조차 알 수 없다고 푸념한다.

어느 정도 이름을 알린 Zeus의 경우에는
모 미들웨어 제품과 에러코드까지 호환된다고 하니
이것은 개방과 호환의 흐름인 것인지
지금까지 보여준 놀라운 개발 속도의 원천인지는 알 수 없다.

OS는 무슨 이유로 만드는가

헤아릴 수 없는 40%의 시장이 application에 있다면서
application에 선택과 집중하는 대신에 OS의 구색을 갖추려는 이유는 무엇인가.

TmaxOS는 개인사용자들이야 흥미로 한 번 설치했다 지울 수는 있겠지만
엔터프라이즈 마켓에서 인프라를 바꾸는 것이 말처럼 쉬운 일도 아니고

W32 API, Posix API와 100% 호환되도록 만드는데 성공하여
주장하는 것처럼 SW stack, 표준 인터페이스의 밑단을 담당할 수 있게 된다고 해도
타사 제품에서 Tmax로 이탈시킬 수 있을만큼, 그 반대 역시 쉽다면
구매자에게 어필할 수 있는 장점이라 믿고 있는가.

아니면 진흥 정책을 등에 업고, 찍어내는 연구소를 믿고, 어쨌거나 밀어붙이는 로비력으로
전 제품군이 다 있으면 경쟁력이 있다는 미리 만든 명제를 억지로 끼워맞출 것인가.

invalid-file

Posted by in0de

What American accent do you have?

Miscellany 2008. 1. 21. 03:32

Paa-k the caa- in haa-vid yaa-d!

What American accent do you have? Your Result: The Midland "You have a Midland accent" is just another way of saying "you don't have an accent." You probably are from the Midland (Pennsylvania, southern Ohio, southern Indiana, southern Illinois, and Missouri) but then for all we know you could be from Florida or Charleston or one of those big southern cities like Atlanta or Dallas. You have a good voice for TV and radio.
Philadelphia
The South
The West
Boston
The Northeast
The Inland North
North Central
What American accent do you have? Quiz Created on GoToQuiz

Posted by in0de

Remote Desktop port override

Miscellany 2007. 12. 31. 09:51

HKLM-SYSTEM-CurrentControlSet-Control-TerminalServer-Wds-rdpwd-Tds-tcp
PortNumber=80

HKLM-SYSTEM-CurrentControlSet-Control-TerminalServer-WinStations-RDP_Tcp
PortNumber=80

이렇게 하면 80포트로 원격 데스크탑을 열어둘 수 있는데
역시 IIS를 다른 포트로 설정해주어야 한다.

Posted by in0de

Striping to Raid1+0 Migration

Miscellany 2007. 12. 31. 09:12

Building Raid 10

Stripe-mirror를 Mirror-stripe로 변경하는 방법이 존재하기 때문에
stripe → stripe-mirror도 왠지 될 것 같다는 생각을 했지만
Highpoint 2320에는 방법이 없는 듯 하다.
결국 재구성 ㅠ_ㅠ

Posted by in0de

Init()

Miscellany 2007. 8. 13. 03:28

Hello, world!

Posted by in0de

이전 1 2 다음

in0de

'Miscellany'에 해당되는 글 16건