(originally posted to the BBEdit-Talk list, but posting here too since the answer might help others)

I’m looking for a regex pattern that will find quoted strings (double quotes) but skip (double-)quoted strings containing any of the following characters: $, ‘, “, (dollar sign, single quote, double quote, backslash)

At first I tried “[^$'”\]+?” but it was matching the end of one quoted string and the beginning of the next, so I’m clearly missing something.

Regexes in Depth: Advanced Quoted String Matching was helpful, but didn’t explain how to negate strings containing the characters above.

Strings that should fail to match:

// contains quotes $str = "`zcol ACOL` NUMBER(32,2) DEFAULT 'The "cow" (and Jim''s dog) jumps over the moon' PRIMARY, INTI INT AUTO DEFAULT 0, zcol2"afs ds"; // contains dollar signs, backslashes and single quotes ADOConnection::outp( " -- $_SESSION['AVAR']={$_SESSION['AVAR']}",false); // contain single quotes if (strncmp($val,"'",1) != 0 && substr($val,strlen($val)-1,1) != "'") {

Strings that should successfully match:

$myvar = "this is my quoeted ".$and_another_var." and another string";

Also, quoted strings should not be preceded with a backslash.

I’ve read and reread the BBEdit docs (which are great) but I’ve been unable to come up with a method that passes all of these tests.

I never had any idea this could be such a complicated problem. Does anyone see what I’m missing?

Update

Matching negative character classes is prone to difficulties because it’s hard to manage what comes before and after the class. That’s why I ended up using the following, which worked more or less well for me and avoided matching properly quoted strings inside HTML.

(?s)(?<!name=|action=|align=|valign=|width=|height= |nowrap=|scope=|class=|id=|style=|type=|value=|method=|border= |cellspacing=|cellpadding=|colspan=|size=|maxlength=|for=|label= |rows=|cols=|wrap=|language=|href=|version=|fuse=|charset=|src= |alt=|title=|xmlns=|http-equiv=|rel=|content=|rowspan=|checked= |accept=|face=)(?<!')(?<!\)(?<!?>) "((?!.|,|, | ,| , |. | . |:| :|: | : )[[:alnum:] -_.,:%@<>?()*/]*?(?<!\))"

Update 2

Give me a break! Here’s the solution to this problem: matching quoted strings.

We’re launching a new small business server product in the coming weeks, ideal for small businesses that need automated backups (and restores), shared internet, shared files, and one or two other goodies. The server is only available for rent starting at 200€/month (including maintenance). This product is, to some degree, the culmination of about 3 years of running our own, small, hosting environment which, as far as we can tell, has not (yet) been compromised. I doubt we could keep a determined hacker from getting in but we’ve so far been able to keep the script kiddies at bay. Here are some of the things we’ve learned along the way.

Use a firewall, even a software-based firewall such as the Endian Firewall. You’ll have to work some magic internally if you want to use host-based routing, but more complication just makes hacking more complicated and unless you have a really juicy target, most hackers will go elsewhere (we presume).
Install and configure mod_security (claims to protect against xss and many other things automagically). We haven’t been able to verify its functionality, but just knowing there’s another layer there makes us feel better 😀

PHP

  • turn off fopen wrappers
  • turn off register globals
  • turn off expose_php
  • disable unused functions and classes
  • install only the extensions you’re sure you’ll need

Disable other server side scripting engines and CGI (assuming you are running PHP as an apache module)
Turn off other unused services

  • email
  • telnet
  • ftp
  • ssh
  • etc.

Uninstall unneeded software (such as the whole Gnome interface and anything that requires runlevel 5 to function – this is a server after all). You might even consider starting building the server with a base in stall of Debian or Ubuntu Server (both of which fit in 64 MB of memory).
Log everything and increase the log history (double-edged sword).

Don’t expose what web server you are running (or PHP or any other server-side technologies) in HTTP responses. In fact, if possible, alter the server signature (and fingerprint) to something unrecognizable or too generic to be of much help.

I’m sure there are more tips I’m forgetting, but these should help you get started. I’d love to hear others experiences and tips if you care to share…

Roger Johansson over at 456 Berea Street, reflecting on a series of articles by John Allsopp regarding HTML semantics, asks the question: “Should there be another way of extending and improving the semantics of HTML without requiring the specification to be updated?”

Personally, I think the issue revolves around the misuse of HTML to mark up something other than research papers.

It is my understanding that HTML is a subset of SGML, a markup language used to mark up research papers for mass reproduction on offset printers. As such, the vocabulary (the tags) in HTML reflect the type of data being marked up. Consequently, when HTML is used to mark up documents that are not academic in nature (are not research papers), authors are left cobbling together solutions to retain the semantic value, but that rarely works. For example, if you want to mark up a mathematic equation, you’ll need the MathML specification precisely because HTML doesn’t have the vocabulary necessary for describing the content.

I find it a little ironic that Tim Berners-Lee has basically turned everyone into an academic in some sense, by enabling them to do massive research and post their findings. However, current technology limits us to “browsing” research papers, even though we’ve creatively found ways to publish much, much more than that.

I think the world is missing a browser that is able to render a variety of markup languages (vocabularies), including HTML, MathML, XHTML, XHTML2, XForms, SMIL, and others (although the last 2 are not technically markup languages). I can imagine a world in which marketers define their own markup specification for sharing data (a problem I think microformats are trying to solve) safely. In fact, markup languages can be defined for nearly any field. The problem is, we don’t have web browsers capable of rendering the data in the source documents in any meaningful fashion because no formatting information is associated with any of the elements of these foreign markup languages. In fact, I find it hard to imagine what a marketing database or recipe list would look like if not some kind of document.

So, in conclusion, I’m not sure if I’ve made my point, but basically I think any semantic improvements in HTML will come from focusing on the domain it was originally intended for (academia) than by trying to extend it to other domains that have little or nothing to do with writing research papers.