Representing complex HTML documents will be difficult and full of corner cases, but I just wanted to share a couple techniques to show how to get this kind of program started. This answer differs in that it uses data abstraction and the toJSON
method to recursively build the result
Below, html2json
is a tiny function which takes an HTML node as input and it returns a JSON string as the result. Pay particular attention to how the code is quite flat but it's still plenty capable of building a deeply nested tree structure – all possible with virtually zero complexity
// data Elem = Elem Node_x000D_
_x000D_
const Elem = e => ({_x000D_
toJSON : () => ({_x000D_
tagName: _x000D_
e.tagName,_x000D_
textContent:_x000D_
e.textContent,_x000D_
attributes:_x000D_
Array.from(e.attributes, ({name, value}) => [name, value]),_x000D_
children:_x000D_
Array.from(e.children, Elem)_x000D_
})_x000D_
})_x000D_
_x000D_
// html2json :: Node -> JSONString_x000D_
const html2json = e =>_x000D_
JSON.stringify(Elem(e), null, ' ')_x000D_
_x000D_
console.log(html2json(document.querySelector('main')))
_x000D_
<main>_x000D_
<h1 class="mainHeading">Some heading</h1>_x000D_
<ul id="menu">_x000D_
<li><a href="/a">a</a></li>_x000D_
<li><a href="/b">b</a></li>_x000D_
<li><a href="/c">c</a></li>_x000D_
</ul>_x000D_
<p>some text</p>_x000D_
</main>
_x000D_
In the previous example, the textContent
gets a little butchered. To remedy this, we introduce another data constructor, TextElem
. We'll have to map over the childNodes
(instead of children
) and choose to return the correct data type based on e.nodeType
– this gets us a littler closer to what we might need
// data Elem = Elem Node | TextElem Node_x000D_
_x000D_
const TextElem = e => ({_x000D_
toJSON: () => ({_x000D_
type:_x000D_
'TextElem',_x000D_
textContent:_x000D_
e.textContent_x000D_
})_x000D_
})_x000D_
_x000D_
const Elem = e => ({_x000D_
toJSON : () => ({_x000D_
type:_x000D_
'Elem',_x000D_
tagName: _x000D_
e.tagName,_x000D_
attributes:_x000D_
Array.from(e.attributes, ({name, value}) => [name, value]),_x000D_
children:_x000D_
Array.from(e.childNodes, fromNode)_x000D_
})_x000D_
})_x000D_
_x000D_
// fromNode :: Node -> Elem_x000D_
const fromNode = e => {_x000D_
switch (e.nodeType) {_x000D_
case 3: return TextElem(e)_x000D_
default: return Elem(e)_x000D_
}_x000D_
}_x000D_
_x000D_
// html2json :: Node -> JSONString_x000D_
const html2json = e =>_x000D_
JSON.stringify(Elem(e), null, ' ')_x000D_
_x000D_
console.log(html2json(document.querySelector('main')))
_x000D_
<main>_x000D_
<h1 class="mainHeading">Some heading</h1>_x000D_
<ul id="menu">_x000D_
<li><a href="/a">a</a></li>_x000D_
<li><a href="/b">b</a></li>_x000D_
<li><a href="/c">c</a></li>_x000D_
</ul>_x000D_
<p>some text</p>_x000D_
</main>
_x000D_
Anyway, that's just two iterations on the problem. Of course you'll have to address corner cases where they come up, but what's nice about this approach is that it gives you a lot of flexibility to encode the HTML however you wish in JSON – and without introducing too much complexity
In my experience, you could keep iterating with this technique and achieve really good results. If this answer is interesting to anyone and would like me to expand upon anything, let me know ^_^
Related: Recursive methods using JavaScript: building your own version of JSON.stringify