ChubakPDP11+TakeWithGrainOfSalt@programming.dev to

Programming@programming.dev · 7 months ago

'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

13

'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

ChubakPDP11+TakeWithGrainOfSalt@programming.dev to

Programming@programming.dev · 7 months ago

Look 0 of my work involves HTML, well maybe 1-2 percent does; however, about 60% of my work involves regular expressions, grammar, lexical scanning and syntactic parsing, so it still irks me, and will irk me beyond my grave, when people say shit like ‘Don’t parse HTML/Markdown/etc with regex! Use a parser generator!’

So this is stupid, because most people know that HTML and Markdown are not the type of languages that require a push-down parser, or even a simple LL(1) recursive-descent parser! Unless by ‘parser generator’ they mean ‘lexer generator’ or ‘PEG generator’, they are wrong, or at least, partly incorrect.

Like my diabetes, they are not grammatically Type 2 (Chomsky-wise, Context-Free); rather, they are Type 3 (Chomsky-wise, Regular).

It’s preferred if you don’t do a syntax-directed lexical translation of Markdown or HTML, and it’s best if you build a tree. I learned that making Mukette and I am currently using my implementation of ASDL to build a tree. But truth is, unlike Context-Free languages, like any non-markup language, it is ENTIRELY possible to do a syntax-directed translation of HTML and Markdown, using pre-compiled, or runtime-compiled regex.

You will have to introduce states to make it a proper Automata, but even that is not required. I once did a syntax-directed translation of Markdown to HTML in AWK! With just one extra state.

I don’t remember the copypasta that was talk of the town 10 years ago, I was a kid back then (17) and I could not dig it up. But it’s a troll that has stuck with me ever since.

Maybe, just maybe, a PEG paser generator could have been what they meant. But even then, PEG generators generate a recursive-descent parser most of the times.

In fact, I dare you to use Byacc, Btacc, Bison, Racc, PYLR, ANTLR, peg(1), leg(1), PackCC or any of these LALR or LL parser generators to parse a markup language. You’ll have a very bad time, it is not impossible, it’s just an overkill.

TL;DR: Most markup languages, like HTML or Markdown, are best lexed, not parsed! Even if you wish to make a tree out of it. But for syntax-directed translations, REs would do.

Thanks.

PS: If you translate a markup language into a tree, you can translate that tree into other markup languages. That’s what Pandoc does. Pandoc is hands-down the best piece of tool I have laid my hands on.

Chat

DaleGribble88@programming.dev
link
fedilink
English
arrow-up
4·
7 months ago
HTML parsers scare me. I already knew it was a big job, but this blog post sealed the deal that HTML, err… the web’s interpretation of HTML(?), is one heck of a mess.
https://jakearchibald.com/2023/against-self-closing-tags-in-html/

Programming@programming.dev

programming@programming.dev

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: !programming@programming.dev

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person’s post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you’re posting long videos try to add in some form of tldr for those who don’t want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

175 users / day
690 users / week
3.08K users / month
9.51K users / 6 months
3 local subscribers
17.3K subscribers
1.87K Posts
29.9K Comments
Modlog