Performance optimization - post your saves

SoShootMe · Post by **SoShootMe** » Tue Aug 24, 2021 7:19 am

mrvn wrote: ↑
Tue Aug 24, 2021 1:59 am

SoShootMe wrote: ↑
Fri May 07, 2021 4:07 pm

ptx0 wrote: ↑
Fri May 07, 2021 2:38 pm
too bad we can't pin threads to cores
AFAIK you can, and I've often thought it may offer some (small) performance benefit due to caches. But you have to make sure other things are excluded from running on those cores too, so in most cases it seems like micro-management overkill.
There can be some gain with improved cache hits. If you have 2 sockets/cores with separate caches then pining threads to each set of cores/threads that share caches can be beneficial. Even if they get interrupted every now and then.

The reason I wrote "small" was on the basis that it won't take long to "warm up" the cache compared to how long a thread will typically run before being pre-empted, assuming it doesn't call/trap into the kernel. In other words, only a small fraction of potential progress is lost by needing to fill the cache (non-pinned case) or, equivalently, there is only a small gain by not needing to fill the cache (pinned case).

The problem with other threads running on the pinned core is that they will "steal" time from the pinned thread, and may "pollute" the cache, both eroding the gain from pinning. Of course, the smaller (or larger) that gain is, the more (or less) important it is to avoid time stealing/cache pollution.

Since you went on to talk about NUMA... I was considering only a non-NUMA system. Although a cache local to certain core(s) has some similarities to main memory local to certain core(s) in a NUMA system.

mrvn · Post by **mrvn** » Tue Aug 24, 2021 10:49 am

SoShootMe wrote: ↑
Tue Aug 24, 2021 7:19 am

mrvn wrote: ↑
Tue Aug 24, 2021 1:59 am

SoShootMe wrote: ↑
Fri May 07, 2021 4:07 pm

ptx0 wrote: ↑
Fri May 07, 2021 2:38 pm
too bad we can't pin threads to cores
AFAIK you can, and I've often thought it may offer some (small) performance benefit due to caches. But you have to make sure other things are excluded from running on those cores too, so in most cases it seems like micro-management overkill.
There can be some gain with improved cache hits. If you have 2 sockets/cores with separate caches then pining threads to each set of cores/threads that share caches can be beneficial. Even if they get interrupted every now and then.
The reason I wrote "small" was on the basis that it won't take long to "warm up" the cache compared to how long a thread will typically run before being pre-empted, assuming it doesn't call/trap into the kernel. In other words, only a small fraction of potential progress is lost by needing to fill the cache (non-pinned case) or, equivalently, there is only a small gain by not needing to fill the cache (pinned case).

The problem with other threads running on the pinned core is that they will "steal" time from the pinned thread, and may "pollute" the cache, both eroding the gain from pinning. Of course, the smaller (or larger) that gain is, the more (or less) important it is to avoid time stealing/cache pollution.

Since you went on to talk about NUMA... I was considering only a non-NUMA system. Although a cache local to certain core(s) has some similarities to main memory local to certain core(s) in a NUMA system.

I was kind of hinting at a middle ground there.

For example you pin some threads to core {0,1,2,3} and some to {4,5,6,7} because sets of 4 cpu core/threads share L1/L2 caches. But there are 2 L3 caches so you don't want threads to jump between those two. L3 caches also are the largest so they take the longest to fill back up after a switch.

OvermindDL1 · Post by **OvermindDL1** » Fri Jan 14, 2022 7:05 pm

As requested by posila at viewtopic.php?p=559889#p559889 posting this save to look at Surface::listFriendlyEntities performance issues:
https://overminddl1.com/Factorio/perf/OverSE-E1-22.zip

ptx0 · Post by **ptx0** » Sun Jan 16, 2022 9:07 pm

mrvn wrote: ↑
Tue Aug 24, 2021 1:59 am
After that it makes a huge difference with NUMA but only if your memory allocation and usage is also NUMA aware. For example you would allocate all inserters in a memory segment in one NUMA domain and pin the thread doing inserter updates to cores in the same NUMA domain. The thread would run a lot faster there. Or split the map into quadrants and place each quadrant into one NUMA domain and pin a thread there. Then each assembler, belt, inserter needs to be allocated in the right memory block and processed by the right thread.

Factorio isn't really designed for that. The core design was really single threaded and since then it only got a few patch ups for special things that could be made multi threaded after the fact. Like the fluid system that could be broken into independent parts. But there isn't a inserter thread that runs in parallel with for example an assembler thread. It's not designed that way. Afaik nothing in factorio would be able to optimize for NUMA and pinned threads without a major overhaul. It's just not designed that way and maybe doesn't fit that memory model at all.

Best approach might be to split things by geography. Split the map into largish tiles and everything in that tile gets pinned to one core and it's closest memory. Of course that requires extra work at the tile boundaries where threads would collide but given large enough tiles that can be minimized (overhead wise). But that would probably only help systems with NUMA as the working set is far to big to make cache effects relevant in that optimization.

one of the devs mentioned that they'd done a small scale experiment where they refactored memory allocation for chunks to use more localised storage. either they didn't do it very well, or missed something, but it had very mild gains in performance.

then again, i've suggested they use mimalloc before and rseding91's system only sees 6% gain, because he insists on testing on Windows.

quyxkh · Post by **quyxkh** » Mon Jan 17, 2022 6:06 pm

ptx0 wrote: ↑
Sun Jan 16, 2022 9:07 pm
one of the devs mentioned that they'd done a small scale experiment where they refactored memory allocation for chunks to use more localised storage. either they didn't do it very well, or missed something, but it had very mild gains in performance.

then again, i've suggested they use mimalloc before and rseding91's system only sees 6% gain, because he insists on testing on Windows.

"Only" sees 6% gain.

If they'd additionally link with libhugetlbfs on linux that'd ... I think the only fair characterization here is "skyrocket". I'd expect >25% speed boost on big maps.

edit: I've seen the effects https://www.reddit.com/r/factorio/comme ... uge_pages/ reports

ptx0 · Post by **ptx0** » Tue Jan 18, 2022 4:13 am

viewtopic.php?p=537779#p537779

it was in this thread, too

MartinG · Post by **MartinG** » Sat Jan 29, 2022 1:33 am

Hello, I described a significant performance drop in this post: viewtopic.php?f=49&t=101353&p=560747#p560747

Anyone care to check it out?

deep_remi · Post by **deep_remi** » Thu Mar 31, 2022 12:42 pm

This circuit is a good stress test for combinators (deep convolutional neural network with ~150k combinators):
viewtopic.php?f=193&p=565099
I feel it could be made faster when it is doing nothing. I tried to power it off when it is not used, but it made no significant difference. Any optimization trick is welcome.

Sopel · Post by **Sopel** » Thu Apr 14, 2022 3:07 pm

I got curious and tried to profile factorio on some 10k spm megabase viewtopic.php?t=90233 (thanks for leaving debug symbols in btw, it's the second time they are helpful). I focused on the main thread, and I'm happy to see that belts and trains account for about 10% runtime each, which is reasonable; there's already a lot of railway and belt logic gets nicely split and offloaded. What is troubling is that inserter update takes 50% of main thread's runtime! I always thought that the crafting entities would be a bigger offender but it turns out inserters are 10x more costly to run! Is it really not possible to speed this up? Inserter updates don't strike me as particularly dependent on each other, at last not densely, so there surely is some parallelism possible on large bases? Construct a dependency forest based on connections and parallelize from there (the only thing I see that requires mutual exclusion is inventory input/output)? Whatever, just something to think about.

I'm kinda interested in benchmarking individual entities in isolation now, but will probably have to wait until I have more time. In particular, how much better would loaders (especially now with parallelized transport lines) be? Which inserter setup is best per item/s?

Post by **Rseding91** » Fri Apr 15, 2022 2:04 pm

Sopel wrote: ↑
Thu Apr 14, 2022 3:07 pm
I got curious and tried to profile factorio on some 10k spm megabase viewtopic.php?t=90233 (thanks for leaving debug symbols in btw, it's the second time they are helpful). I focused on the main thread, and I'm happy to see that belts and trains account for about 10% runtime each, which is reasonable; there's already a lot of railway and belt logic gets nicely split and offloaded. What is troubling is that inserter update takes 50% of main thread's runtime! I always thought that the crafting entities would be a bigger offender but it turns out inserters are 10x more costly to run! Is it really not possible to speed this up? Inserter updates don't strike me as particularly dependent on each other, at last not densely, so there surely is some parallelism possible on large bases? Construct a dependency forest based on connections and parallelize from there (the only thing I see that requires mutual exclusion is inventory input/output)? Whatever, just something to think about.

I'm kinda interested in benchmarking individual entities in isolation now, but will probably have to wait until I have more time. In particular, how much better would loaders (especially now with parallelized transport lines) be? Which inserter setup is best per item/s?

You'll notice that the majority of the time is spent inspecting/chasing items on belts. Since inserters have 2 purposes: move items between a belt and something else, and move items between 2 inventories. As a player if you use belts; the majority of the time will be an inserter trying to remove/add an item from/to the belt. This is unavoidable.

Since multiple inserters can interact with belts, circuit networks, inventories, destroy item-entities, create item-entities, go active/inactive, and so on; running them in multiple threads is not viable. The part of the inserter update where it *doesn't* mutate other entities/state is the part you don't see in the profiler because it takes near zero time.

Sopel · Post by **Sopel** » Fri Apr 15, 2022 6:35 pm

Rseding91 wrote: ↑
Fri Apr 15, 2022 2:04 pm

Sopel wrote: ↑
Thu Apr 14, 2022 3:07 pm
I got curious and tried to profile factorio on some 10k spm megabase viewtopic.php?t=90233 (thanks for leaving debug symbols in btw, it's the second time they are helpful). I focused on the main thread, and I'm happy to see that belts and trains account for about 10% runtime each, which is reasonable; there's already a lot of railway and belt logic gets nicely split and offloaded. What is troubling is that inserter update takes 50% of main thread's runtime! I always thought that the crafting entities would be a bigger offender but it turns out inserters are 10x more costly to run! Is it really not possible to speed this up? Inserter updates don't strike me as particularly dependent on each other, at last not densely, so there surely is some parallelism possible on large bases? Construct a dependency forest based on connections and parallelize from there (the only thing I see that requires mutual exclusion is inventory input/output)? Whatever, just something to think about.

I'm kinda interested in benchmarking individual entities in isolation now, but will probably have to wait until I have more time. In particular, how much better would loaders (especially now with parallelized transport lines) be? Which inserter setup is best per item/s?
You'll notice that the majority of the time is spent inspecting/chasing items on belts. Since inserters have 2 purposes: move items between a belt and something else, and move items between 2 inventories. As a player if you use belts; the majority of the time will be an inserter trying to remove/add an item from/to the belt. This is unavoidable.

Since multiple inserters can interact with belts, circuit networks, inventories, destroy item-entities, create item-entities, go active/inactive, and so on; running them in multiple threads is not viable. The part of the inserter update where it *doesn't* mutate other entities/state is the part you don't see in the profiler because it takes near zero time.

Creation/destruction of item entities is not a common occurence so I think it could be tracked per thread and then processed sequentially afterwards. As for the dependencies/interactions I still believe that can be resolved dynamically pretty well. I coded up some simplified demonstration in python with a naive algorithm. It randomizes inserter connections (which is the worst case) and then attempts to split the inserters into groups that are independent.

Code: Select all

import random

class InserterEntity:
    '''
    Each state would have a virtual "mutex" that identifies it.
    I presume it would be enough to have one per inventory/belt/circuit.
    Could just use a pointer to the state?
    '''
    def __init__(self, source_state_mutex, destination_state_mutex, circuit_state_mutex):
        self.source_state_mutex = source_state_mutex
        self.destination_state_mutex = destination_state_mutex
        self.circuit_state_mutex = circuit_state_mutex

    def get_dependent_states(self):
        if self.circuit_state_mutex:
            return (self.source_state_mutex, self.destination_state_mutex, self.circuit_state_mutex)
        else:
            return (self.source_state_mutex, self.destination_state_mutex)

def create_some_inserter_entities(num_inserters, num_other_entities, num_circuits, circuit_chance):
    entities = []
    for i in range(num_inserters):
        source_id = random.randrange(num_other_entities)
        destination_id = num_other_entities + random.randrange(num_other_entities)
        circuit_id = num_other_entities * 2 + random.randrange(num_circuits) if random.random() < circuit_chance else None
        entities.append(InserterEntity(source_id, destination_id, circuit_id))
    return entities

def split_into_independent_groups(entities, num_groups):
    '''
    Split into num_groups indepdendent groups and one more group
    that contains items that are always dependent on more than 1 other group.
    '''
    groups = [[] for i in range(num_groups)]
    mutexes_in_groups = [set() for i in range(num_groups)]
    seq_after = []

    for entity in entities:
        deps = entity.get_dependent_states()
        has_dep = [False] * num_groups
        for i, mutexes_in_group in enumerate(mutexes_in_groups):
            for dep in deps:
                if dep in mutexes_in_group:
                    has_dep[i] = True
                    break
        num_deps = has_dep.count(True)
        if num_deps > 1:
            seq_after.append(entity)
        elif num_deps == 1:
            for i in range(num_groups):
                if has_dep[i]:
                    groups[i].append(entity)
                    mutexes_in_groups[i].update(deps)
                    break
        elif num_deps == 0:
            best_group = 0
            for i in range(1, num_groups):
                if len(groups[i]) < len(groups[best_group]):
                    best_group = i
            groups[best_group].append(entity)
            mutexes_in_groups[best_group].update(deps)

    return groups, seq_after

num_groups = 8
num_inserters = 10000
num_unique_connectibles = 3000
num_circuit_networks = 20
inserter_with_circuit_rate = 0.01
entities = create_some_inserter_entities(num_inserters, num_unique_connectibles, num_circuit_networks, inserter_with_circuit_rate)
print('Total: ', len(entities))
groups, seq_after = split_into_independent_groups(entities, num_groups)
for i, group in enumerate(groups):
    print('Group {}: '.format(i), len(group))
print('Seq group: ', len(seq_after))

print('Second step maybe to parallelize leftover?')
groups, seq_after = split_into_independent_groups(seq_after, num_groups)
for i, group in enumerate(groups):
    print('Group {}: '.format(i), len(group))
print('Seq group: ', len(seq_after))

I'll leave it now. I don't think there's more I can talk about, knowing the actual in-game mechanics just as much as playing allows I don't have enough necessary knowledge to discuss this.

Post by **ssilk** » Sat May 14, 2022 8:15 am

I want to point to this subject
viewtopic.php?f=6&t=102385 [Optimization] Improvement for container insertion logic

Edit: discussion was merged and is now at viewtopic.php?p=567987#p567987

Because it seems to be a problem in megabases:

… the fact that insertion/removal logic behaves exactly as if it's iterating over every slot, and very large containers (warehouse mods, etc) are notorious for causing lag unless used sparingly, suggests it.
There have been threads discussing this and talking about adding indexing of some sort before, though I decided to make a new thread because i'm suggesting a specific indexing method.

It describes a way to add an index to the chest:

if you're just adding or removing the same items from a chest, and you've already found the slot to do so with, there's no reason to iterate over the slots again

Nevertheless it’s worth reading, because it describes the problem very well.

Taneeda · Post by **Taneeda** » Wed May 18, 2022 11:41 am

Here a save file of my spaghetti megabase, not the biggest but pretty big (~2.9kSpm) so I thought maybe its interesting to use it for researching possible performance improvements. Let me know how I can help...

The link contains several iterations of the same base. https://drive.google.com/drive/folders/ ... sp=sharing

dasiro · Post by **dasiro** » Sun Jun 05, 2022 3:13 pm

Sopel wrote: ↑
Thu Apr 14, 2022 3:07 pm

how did you make this?

I've got a seablock save that's 200MB COMPRESSED (2.2GB uncompressed) with a few extra mods, but really mega-base sized imho. my UPS is horrible and the game takes up about 30GB+ while running (half is pagefile on my ssd). It takes minutes to load/save and I'm more and more thinking about abandoning it, just because of the bad performance.

my trusty old i5-6600K, 16GB ought to be enough, but my train updates are running wild at around 16 most of the time causing my UPS to drop to 30 :/

If anyone could help me salvage this 960+h save it would be very appreciated
https://www.transferxl.com/download/08jk9yGYQtqvY7
(retention = 1 week so msg me if it's expired)

Post by **Rseding91** » Sun Jun 05, 2022 7:24 pm

dasiro wrote: ↑
Sun Jun 05, 2022 3:13 pm
... I've got a seablock save that's 200MB COMPRESSED (2.2GB uncompressed) with a few extra mods ...

Just so you know, the mod train-log is what's taking 98% of your save file size and runtime memory usage.

dasiro · Post by **dasiro** » Sun Jun 05, 2022 8:53 pm

I suspected LTN but I couldn't find anything that pointed in it's direction. Certainly since the UPS returned to 60 regularly but it happened both with a lot of trains in depot and in movement.

I've cleared the history and now my compressed size is only 24MB so the loading and saving is already a HUGE win.

What I suspected to be 1 problem appear to be 2, since the UPS is still unchanged with factorio only taking up 35% CPU and 50% usage in total

Is there any way I could diagnose this myself, since the saves are serialized

thanks a lot already and I guess that memory upgrade can wait a bit longer now so I'll get some more merch

Post by **Rseding91** » Sun Jun 05, 2022 10:36 pm

F4 -> show-entity-time-usage for per-entity timings
F4 -> show-time-usage for other timings

Qon · Post by **Qon** » Thu Jun 09, 2022 6:23 pm

Issue 1

I get bad performance (9 UPS) with this somewhat large combinator contraption with ~800 000 combinators (~500 000 not counting constants). I expected that, obviously there's a limit to how many combinators the game can simulate per update. If it can be sped up in some way anyways that would be neat, I get 11 UPS when all the combinators are in a fixed state with no input updates. Might just be the switches and lamps that don't toggle when the circuits don't update that give those extra UPS. Only updating output of combinators that get updated inputs would speed my circuit up immensely. I guess it hasn't been done already because it would reduce performance for circuits that are always changing their inputs?
(Editor Extensions mod is used for power but can be replaced with vanilla interface for testing, Recursive Blueprints+ is needed for the circuit to do anything but isn't needed for performance testing. Other mods are not needed either and don't really affect UPS.)

Issue 2, Maybe this would better fit as a bug report:

Well, the reason I report this is that deconstruction of just a handful of combinators completely freezes the game (marking is fine without instant deconstruction, it freezes when bots start removing or if you have /editor deconstruction). It takes like 10 seconds to remove ~100 (one cell) entities with /editor instant deconstruction where the game is completely frozen. But only from the big connected block, things not connected to the big contraption are removed quickly. This is the same even if mods are disabled. Trying to remove more at once just makes the game freeze longer or permanently, which makes it practically impossible to remove the thousands of cells that exist now. Construction is instant still though. If you manage to remove a handful of cells after waiting for minutes and press Ctrl+Z the entities are back instantly. Is this performance issue fixable?

mrvn · Post by **mrvn** » Mon Aug 29, 2022 9:13 am

Rseding91 wrote: ↑
Fri Apr 15, 2022 2:04 pm

Sopel wrote: ↑
Thu Apr 14, 2022 3:07 pm
I got curious and tried to profile factorio on some 10k spm megabase viewtopic.php?t=90233 (thanks for leaving debug symbols in btw, it's the second time they are helpful). I focused on the main thread, and I'm happy to see that belts and trains account for about 10% runtime each, which is reasonable; there's already a lot of railway and belt logic gets nicely split and offloaded. What is troubling is that inserter update takes 50% of main thread's runtime! I always thought that the crafting entities would be a bigger offender but it turns out inserters are 10x more costly to run! Is it really not possible to speed this up? Inserter updates don't strike me as particularly dependent on each other, at last not densely, so there surely is some parallelism possible on large bases? Construct a dependency forest based on connections and parallelize from there (the only thing I see that requires mutual exclusion is inventory input/output)? Whatever, just something to think about.

I'm kinda interested in benchmarking individual entities in isolation now, but will probably have to wait until I have more time. In particular, how much better would loaders (especially now with parallelized transport lines) be? Which inserter setup is best per item/s?
You'll notice that the majority of the time is spent inspecting/chasing items on belts. Since inserters have 2 purposes: move items between a belt and something else, and move items between 2 inventories. As a player if you use belts; the majority of the time will be an inserter trying to remove/add an item from/to the belt. This is unavoidable.

Since multiple inserters can interact with belts, circuit networks, inventories, destroy item-entities, create item-entities, go active/inactive, and so on; running them in multiple threads is not viable. The part of the inserter update where it *doesn't* mutate other entities/state is the part you don't see in the profiler because it takes near zero time.

You are ignoring the "Construct a dependency forest based on connections" part.

My guess is that he suggests something like with the fluid network where the game detects fluid networks that don't interact and then runs them in parallel. For inserters a "network" is connected by anything where inserters interact within a tick. For example an inserter moves items from a belt into a chest and a second inserter moves them from the chest into an assembler. That puts the two inserters into the same "network". Two inserters remove items from the same belt ==> same "network". Inserters and boilers chained to pass through fuel => one big "network". I'm not sure about an inserters inserting into an assembler and one removing from an assembler. Afaik the input and output are independent inventories and assembler cycle at most once per tick. So that could make the inserters unconnected. But a cycle only starts when the output has space (which depends on the inserter removing stuff). And when the cycle starts the inserter inserting the next batch starts. So that could make them connected. None the less I expect there are many small inserter "networks". One thing that could join a lot of inserters would be circuit wires.

But factories do tend to have blocks of assembler doing one thing, like produce electronic circuit boards. The block is fed by belts, bots or trains and the output again goes to belts, bots or trains. At the very least those would be the border for inserter/assembler "networks". If the game would detect such blocks and then process each block in parallel there should be a noticeable speed increase. Allocating entities for each block in a block of memory could help cache efficiency and, more importantly, prevent false sharing of cache lines. Having entities from different blocks share cache lines easily negates multithreading and often makes it slower than serial execution.

Maybe the fact that inserters take up 50% up the time and it's mostly chasing items on belt is something to learn from. Would you design a system like that in reality? No way. Better to have a splitter like construction that shunts an item to a pickup area whenever that area is empty. Then the inserter can pickup the item from there knowing it won't disappear. It would also be stationary and potentially you could rotate it to a preset orientation, much easier to pick up that way. So make a belt with an output inventory the inserters stack size limits the size of the inventory. The inserters filter sets the inventories filter. And then the belt fills the inventory.

Now there are some problems with that idea. Inserters can be removed or filters can change. In that case the inventory would have to drain back onto the belt. Also multiple inserters can pick up from the same belt, so the belt needs multiple inventories and fill them round-robin. All of this only removes the "chasing" items part.

Alternatively, and probably far easier to implement, would be to make the inserters smarter. Don't go chasing after items trying to overtake the belt speed or hoping the belt will backlog and the item won't escape. Move the inserter to the last point where it can pick up the item and hover there. The go chase towards the item. Also tell every other inserter when you picked an item to chase so they don't steal it. I imagine there are quite a lot of failed chases in large factories that could all be avoided.

PS: Is there any way to get a count of successful and failed chases for inserters? What is the miss rate there?

mrvn · Post by **mrvn** » Thu Sep 01, 2022 12:46 pm

I've made a elevation function for a H-tree fractal using some simple if-else-chains and recursion to produce this:

: h-tree.png (576.5 KiB) Viewed 5583 times

The problem is that this covers "only" a 16384 x 16384 area taking ~3.5m (~24m cpu time) on my 8 core system for a map preview. Vanilla takes ~3s. When using RSO (Resource Spawner Overhaul) the time goes down to 45s (~2s vanilla mapgen). This makes creating a new map take quite a while and the game freezes noticeably whenever a new chunk is generated (just playable with RSO). Making this any larger only gets slower.

Why does RSO make this run 5 times faster? Does each ore call the elevation function again instead of computing it once and passing it to each ore?

Given the recursive nature this generates 13 nested if-else-chain constructs with mirroring (abs function) and rotation at each level. I optimized the main mirroring+rotation in the recursion already with noise.delimit_procedure. I wonder if there is more I can do to speed this up or if this is something the game could optimize better?

One problem is that the recursion is not really a recursion, the code is generated recursively in Lua to generate a huge nested expression. I know I can call a named noise function in another noise function but there doesn't seem to be a way to include passing transformed x/y coordinates to that named noise function, or even better a custom parameter depth. Am I missing something there?

You can test this using https://mods.factorio.com/mod/FractalMaps but for the sake of the discussion I've included the code below. And yes, apart from the locale that is all the code.

Code: Select all

local noise = require("noise")
local tne = noise.to_noise_expression
local var_x = noise.var("x")
local var_y = noise.var("y")
local abs = noise.absolute_value

local square = function(x, y)
  return noise.if_else_chain(
    noise.less_than(80, abs(x)), -1,
    noise.less_than(80, abs(y)), -1,
    1
  )
end

local path1 = function(x, y)
  return noise.if_else_chain(
    noise.less_than(48, abs(y)), square(x, abs(y) - tne(128)),
    noise.less_than(abs(x), 16), 1,
    noise.less_than(abs(y), 16), x,
    -1
  )
end

function make(x, y, depth)
  if depth == 0 then
    return path1(x, y)
  else
    local t = noise.delimit_procedure(tne(128 * 2 ^ (math.floor(depth / 2))) - abs(y))
    return noise.if_else_chain(
      noise.less_than(48, abs(y)), make(t, x, depth - 1),
      noise.less_than(abs(x), 16), 1,
      noise.less_than(abs(y), 16), x,
      -1
    )
  end
end

local depth = 11
local off = 2048 + 512 + 128

data:extend{
  {
    type = "noise-expression",
    name = "h-tree",
    intended_property = "elevation",
    expression = noise.clamp(make(var_x - off, var_y - off, depth), -1, 0)
  }
}

Factorio Forums