Skip to content

Conversation

nobodyinperson
Copy link
Contributor

@nobodyinperson nobodyinperson commented Sep 6, 2024

This commit adds the tlsMode, tlsPort and conflictingServices options to security.acme.[...] to allow using it even if port 80 is blocked and DNS access is impossible.

Initial work based on this commit by @jeschmidt.

I tested these changes successfully (backported to 24.05 though), config here.

Description of changes

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.11 Release Notes (or backporting 23.11 and 24.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Sep 6, 2024
@nobodyinperson nobodyinperson marked this pull request as draft September 6, 2024 16:36
@nobodyinperson
Copy link
Contributor Author

I saw the error Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp 192.168.1.4:443: connect: connection refused while running the acme NixOS tests, not sure what to do with that. Log below.

`nix-build -A nixosTests.acme` output
...
subtest: Check account hashing compatibility with pre-24.05 settings
webserver: must succeed: rm -rf /var/lib/acme/.lego/accounts/*
(finished: must succeed: rm -rf /var/lib/acme/.lego/accounts/*, in 0.04 seconds)
webserver # stopping the following units: [  416.352081] nixos[7049]: switching to system configuration /nix/store/f0rwd6sn6n54gpb95q02cbbq75h2xyhn-nixos-system-webserver-test
webserver # acme-account-d590213ed52603e9128d.target, acme-example.test.timer, acme-finished-example.test.target, acme-finished-httpd-dns.example.test.target, acme-finished-httpd-http.example.test.target, acme-fixperms.service, acme-httpd-dns.example.test.timer, acme-httpd-http.example.test.timer, httpd.service, systemd-tmpfiles-resetup.service, test-renew-httpd.target
webserver # [  416.358732] systemd[1]: acme-httpd-dns.example.test.timer: Deactivated successfully.
webserver # [  416.360637] systemd[1]: Stopped Renew ACME Certificate for httpd-dns.example.test.
webserver # [  416.362111] systemd[1]: Stopped target acme-finished-example.test.target.
webserver # [  416.364438] systemd[1]: Stopped target Reactivate sysinit units.
webserver # [  416.366216] systemd[1]: systemd-tmpfiles-resetup.service: Deactivated successfully.
webserver # [  416.367187] systemd[1]: Stopped Re-setup tmpfiles on a system that is already running..
webserver # [  416.369216] systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dresetup.service.mount: Deactivated successfully.
webserver # [  416.372638] systemd[1]: Stopped target acme-account-d590213ed52603e9128d.target.
webserver # [  416.376560] systemd[1]: Stopped target acme-finished-httpd-dns.example.test.target.
webserver # [  416.379499] systemd[1]: Stopping Apache HTTPD...
webserver # [  416.381274] systemd[1]: Stopped target Remote File Systems.
webserver # [  416.384070] systemd[1]: Stopped target acme-finished-httpd-http.example.test.target.
webserver # [  416.386173] systemd[1]: acme-example.test.timer: Deactivated successfully.
webserver # [  416.387903] systemd[1]: Stopped Renew ACME Certificate for example.test.
webserver # [  416.390074] systemd[1]: acme-httpd-http.example.test.timer: Deactivated successfully.
webserver # [  416.391380] systemd[1]: Stopped Renew ACME Certificate for httpd-http.example.test.
webserver # [  416.394913] systemd[1]: Stopped target Local File Systems.
webserver # [  416.397638] systemd[1]: acme-fixperms.service: Deactivated successfully.
webserver # [  416.398541] systemd[1]: Stopped Fix owner and group of all ACME certificates.
webserver # [  416.405094] systemd[1]: Stopped target test-renew-httpd.target.
webserver # [  416.483942] systemd[1]: httpd.service: Deactivated successfully.
webserver # [  416.488366] systemd[1]: Stopped Apache HTTPD.
webserver # [  416.489245] systemd[1]: httpd.service: Consumed 238ms CPU time, 15.2M memory peak, 3.8K incoming IP traffic, 6.4K outgoing IP traffic.
webserver # activating the configuration...
webserver # removing group ‘wwwrun’
webserver # removing user ‘wwwrun’
webserver # removing obsolete symlink ‘/etc/httpd/httpd.conf’...
webserver # [  416.841478] systemd[1]: Reload requested from client PID 7049 ('.switch-to-conf') (unit backdoor.service)...
webserver # [  416.842736] systemd[1]: Reloading...
webserver # [  417.337875] systemd[1]: Reloading finished in 493 ms.
webserver # restarting sysinit-reactivation.target
webserver # [  417.362665] systemd[1]: Starting Re-setup tmpfiles on a system that is already running....
webserver # [  417.473428] systemd[1]: Finished Re-setup tmpfiles on a system that is already running..
webserver # [  417.475740] systemd[1]: Reached target Reactivate sysinit units.
webserver # reloading the following units: dbus.service
webserver # [  417.481218] systemd[1]: Reloading D-Bus System Message Bus...
webserver # [  417.502419] dbus-daemon[650]: Unknown username "systemd-timesync" in message bus configuration file
webserver # [  417.519278] dbus-daemon[650]: [system] Reloaded configuration
webserver # [  417.520238] dbus-send[7115]: method return time=1725641407.234364 sender=org.freedesktop.DBus -> destination=:1.28 serial=3 reply_serial=2
webserver # [  417.525466] dbus-daemon[650]: Unknown username "systemd-timesync" in message bus configuration file
webserver # [  417.542755] dbus-daemon[650]: [system] Reloaded configuration
webserver # [  417.544067] systemd[1]: Reloaded D-Bus System Message Bus.
webserver # starting the following units: acme-fixperms.service, systemd-tmpfiles-resetup.service
webserver # [  417.549775] systemd[1]: Reached target Remote File Systems.
webserver # [  417.560763] systemd[1]: Starting Fix owner and group of all ACME certificates...
webserver # [  417.562193] systemd[1]: Generate self-signed certificate authority was skipped because of an unmet condition check (ConditionPathExists=!/var/lib/acme/.minica/key.pem).
webserver # [  417.569732] systemd[1]: Make TPM PCR Policy was skipped because of an unmet condition check (ConditionSecurity=measured-uki).
webserver # [  417.576452] systemd[1]: Starting Load Kernel Module efi_pstore...
webserver # [  417.582618] systemd[1]: Starting Create SUID/SGID Wrappers...
webserver # [  417.583869] systemd[1]: File System Check on Root Device was skipped because of an unmet condition check (ConditionPathIsReadWrite=!/).
webserver # [  417.585950] systemd[1]: Reached target Local File Systems.
webserver # [  417.587308] systemd[1]: Update Boot Loader Random Seed was skipped because no trigger condition checks were met.
webserver # [  417.589027] systemd[1]: Clear Stale Hibernate Storage Info was skipped because of an unmet condition check (ConditionPathExists=/sys/firmware/efi/efivars/HibernateLocation-8cf2644b-4b0b-428f-9387-6d876050dc67).
webserver # [  417.597464] systemd[1]: File System Check on Root Device was skipped because of an unmet condition check (ConditionPathIsReadWrite=!/).
webserver # [  417.635995] systemd[1]: modprobe@efi_pstore.service: Deactivated successfully.
webserver # [  417.640709] systemd[1]: Finished Load Kernel Module efi_pstore.
webserver # [  417.645727] systemd[1]: Platform Persistent Storage Archival was skipped because of an unmet condition check (ConditionDirectoryNotEmpty=/sys/fs/pstore).
webserver # [  417.675485] systemd[1]: Finished Fix owner and group of all ACME certificates.
webserver # [  417.909085] systemd[1]: suid-sgid-wrappers.service: Deactivated successfully.
webserver # [  417.910432] systemd[1]: Finished Create SUID/SGID Wrappers.
webserver # [  417.914622] systemd[1]: Started Renew ACME Certificate for http.example.test.
webserver # [  417.915621] systemd[1]: Generate self-signed certificate for http.example.test was skipped because of an unmet condition check (ConditionPathExists=!/var/lib/acme/http.example.test/key.pem).
webserver # [  417.919638] systemd[1]: Starting Renew ACME certificate for http.example.test...
webserver # [  417.967923] acme-http.example.test-start[7195]: Waiting to acquire lock /run/acme/1.lock
webserver # [  417.971353] acme-http.example.test-start[7195]: Acquired lock /run/acme/1.lock
webserver # [  417.972305] acme-http.example.test-start[7195]: + set -euo pipefail
webserver # [  417.973076] acme-http.example.test-start[7195]: + echo 78c80081fedd8a7ae50d
webserver # [  417.973904] acme-http.example.test-start[7195]: + cmp -s domainhash.txt certificates/domainhash.txt
webserver # [  417.977936] acme-http.example.test-start[7195]: + lego --accept-tos --path . -d http.example.test --email hostmaster@example.test --key-type ec256 --http --http.port :80 run
webserver # [  418.010330] acme-http.example.test-start[7198]: 2024/09/06 16:50:07 No key found for account hostmaster@example.test. Generating a P256 key.
webserver # [  418.011983] acme-http.example.test-start[7198]: 2024/09/06 16:50:07 Saved key to accounts/acme-v02.api.letsencrypt.org/hostmaster@example.test/keys/hostmaster@example.test.key
webserver # [  418.017216] acme-http.example.test-start[7198]: 2024/09/06 16:50:07 Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp 192.168.1.4:443: connect: connection refused
webserver # [  418.022460] acme-http.example.test-start[7195]: + echo Failed to fetch certificates. This may mean your DNS records are set up incorrectly. Selfsigned certs are in place and dependant services will still start.
webserver # [  418.024547] acme-http.example.test-start[7195]: Failed to fetch certificates. This may mean your DNS records are set up incorrectly. Selfsigned certs are in place and dependant services will still start.
webserver # [  418.026510] acme-http.example.test-start[7195]: + exit 10
webserver # [  418.028532] systemd[1]: acme-http.example.test.service: Main process exited, code=exited, status=10/n/a
webserver # [  418.029943] systemd[1]: acme-http.example.test.service: Failed with result 'exit-code'.
webserver # [  418.032243] systemd[1]: Failed to start Renew ACME certificate for http.example.test.
webserver # [  418.033355] systemd[1]: Dependency failed for acme-finished-http.example.test.target.
webserver # [  418.034511] systemd[1]: acme-finished-http.example.test.target: Job acme-finished-http.example.test.target/start failed with result 'dependency'.
webserver # [  418.036569] systemd[1]: acme-http.example.test.service: Consumed 62ms CPU time, 10.4M memory peak, 444B incoming IP traffic, 230B outgoing IP traffic.
webserver # the following new units were started: acme-http.example.test.timer
webserver # warning: the following units failed: acme-http.example.test.service
webserver # [  418.387356] nixos[7049]: switching to system configuration /nix/store/f0rwd6sn6n54gpb95q02cbbq75h2xyhn-nixos-system-webserver-test failed (status 4)
webserver: waiting for unit acme-finished-http.example.test.target
webserver: must succeed: ls /var/lib/acme/.lego/accounts/
(finished: must succeed: ls /var/lib/acme/.lego/accounts/, in 0.02 seconds)
Account hash: 1ccf607d9aa280e9af00

(finished: subtest: Check account hashing compatibility with pre-24.05 settings, in 2.72 seconds)
(finished: run the VM test script, in 419.85 seconds)
test script finished in 420.01s
cleanup
kill machine (pid 10)
qemu-kvm: terminating on signal 15 from pid 7 (/nix/store/pgb120fb7srbh418v4i2a70aq1w9dawd-python3-3.12.5/bin/python3.12)
kill machine (pid 31)
qemu-kvm: terminating on signal 15 from pid 7 (/nix/store/pgb120fb7srbh418v4i2a70aq1w9dawd-python3-3.12.5/bin/python3.12)
kill machine (pid 52)
qemu-kvm: terminating on signal 15 from pid 7 (/nix/store/pgb120fb7srbh418v4i2a70aq1w9dawd-python3-3.12.5/bin/python3.12)
kill machine (pid 73)
qemu-kvm: terminating on signal 15 from pid 7 (/nix/store/pgb120fb7srbh418v4i2a70aq1w9dawd-python3-3.12.5/bin/python3.12)
(finished: cleanup, in 0.03 seconds)
kill vlan (pid 8)
/nix/store/dgjdjjsfis71h6385d815b18dhsya7ym-vm-test-run-acme

@nobodyinperson nobodyinperson marked this pull request as ready for review September 6, 2024 17:06
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. labels Sep 6, 2024
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/prs-ready-for-review/3032/4659

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nginx-acme-what-am-i-missing/31636/18

Copy link
Contributor

@HannesGitH HannesGitH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nix-owners nix-owners bot requested a review from m1cr0man October 18, 2024 06:42
@nobodyinperson
Copy link
Contributor Author

I don't know if this is related to my changes here or something else with my setup, but I occasionally see the following systemd error after nixos-rebuild switch:

Job for acme-finished-myfqdn.target canceled.
the following new units were started: libvirtd.service
warning: the following units failed: acme-myfqdn.service

× acme-myfqdn.service - Renew ACME certificate for myfqdn
     Loaded: loaded (/etc/systemd/system/acme-myfqdn.service; enabled; preset: enabled)
     Active: failed (Result: signal) since Tue 2024-10-22 10:37:54 CEST; 227ms ago
TriggeredBy: ● acme-myfqdn.timer
    Process: 109294 ExecStart=/nix/store/bvssciwgsrvs45pl61399pyyi0vqgd2w-unit-script-acme-myfqdn-start/bin/acme-myfqdn-start (code=exited, status=0/SUCCESS)
    Process: 109446 ExecStartPost=/nix/store/dd99d0baghb411xlfavqkka18csi2glf-acme-postrun (code=killed, signal=TERM)
   Main PID: 109294 (code=exited, status=0/SUCCESS)
         IP: 5.2K in, 1.1K out
        CPU: 133ms

Okt 22 10:37:54 yann-desktop-nixos acme-myfqdn-start[109294]: + mv domainhash.txt certificates/
Okt 22 10:37:54 yann-desktop-nixos acme-myfqdn-start[109294]: + chown acme:nginx certificates/domainhash.txt certificates/myfqdn.crt certificates/myfqdn.issuer.crt certificates/myfqdn.json certificates/myfqdn.key
Okt 22 10:37:54 yann-desktop-nixos acme-myfqdn-start[109294]: + cmp -s certificates/myfqdn.crt out/fullchain.pem
Okt 22 10:37:54 yann-desktop-nixos acme-myfqdn-start[109294]: + chmod 640 out/cert.pem out/chain.pem out/fullchain.pem out/full.pem out/key.pem
Okt 22 10:37:54 yann-desktop-nixos acme-myfqdn-start[109294]: + echo 'Releasing lock /run/acme/1.lock'
Okt 22 10:37:54 yann-desktop-nixos acme-myfqdn-start[109294]: Releasing lock /run/acme/1.lock
Okt 22 10:37:54 yann-desktop-nixos systemd[1]: acme-myfqdn.service: Control process exited, code=killed, status=15/TERM
Okt 22 10:37:54 yann-desktop-nixos systemd[1]: acme-myfqdn.service: Failed with result 'signal'.
Okt 22 10:37:54 yann-desktop-nixos systemd[1]: Stopped Renew ACME certificate for myfqdn.
Okt 22 10:37:54 yann-desktop-nixos systemd[1]: acme-myfqdn.service: Consumed 133ms CPU time, received 5.1K IP traffic, sent 1.1K IP traffic.
warning: error(s) occurred while switching to the new configuration

The interesting part seems to be acme-myfqdn.service: Control process exited, code=killed, status=15/TERM.

This only happens sometimes, feels like a race condition somewhere. When I restart the service manually, no error. 🤷

@nobodyinperson
Copy link
Contributor Author

nobodyinperson commented Oct 23, 2024

I pushed a solution for the systemd service race condition.

As there is apparently no built-in systemd mechanism to pause certain services for the time another one runs, workarounds need to be employed. One way is to execute systemctl restart ... in ExecStartPost or ExecStopPost (see e.g. 156884b), but these require root privileges and are prone to races, especially in combination with Conflicts=..., where a (for whatever reason) premature restart of a conflicting unit causes the invoking service to be canceled, with systemd being very loud and unhelpful about it in the cli.

Another approach (thanks @Atemu) is to set Conflicts, After (needed?), OnSuccess and OnFailure to the service to pause. This works and in my testing has been perfectly robust, with the exception that systemd now complains with the warning multiple trigger source candidates for exit status propagation (myservice.service, myservice.sna94.uni-tuebingen.de.service), skipping.. So systemd dislikes that both success and failure trigger a restart, but in this case it should be fine and not do any harm. In my testing this has been a much better experience that the Exec*Post-stunts.

With these changes (56f61e3) I also don't need to touch the pre-existing ExecStartPost-code anymore, so I reverted my added bits there.

@m1cr0man
Copy link
Contributor

This looks good, but I just had one question..

no built-in systemd mechanism to pause certain services for the time another one runs

This is by definition what Before and After do in the context of Type=oneshot services. Looking at the first iteration of the commit, I'm wondering did you get a chance to test one of those options before going in on OnSuccess/OnFailure?

@Atemu
Copy link
Member

Atemu commented Oct 28, 2024

This is by definition what Before and After do in the context of Type=oneshot services.

That'd be news to me. AFAIK Before and After only concern the order in which services are started if they're started at the same time (i.e. to reach some target), not what happens when the services are already running. The only point where they become relevant again is when both services are stopped simultaneously where the reverse order is applied.

What we need here is a method to pause a service after it has started until a certain unit has finished. Conflicts + OnFailure/OnSuccess is a hack to achieve this and inherently has race conditions but it affects the service while it is running.

@nobodyinperson
Copy link
Contributor Author

I'm wondering did you get a chance to test one of those options before going in on OnSuccess/OnFailure?

Yes, I tested many variations with Exec*Post and as I wrote in the commit messages and the PR description, my findings are that in combination with Conflicts these appear to be very fragile and have races: When systemd executes any Exec*Post just a tad earlier than the service enters finished state, it recognizes 'oh, a conflicting service has started, we have to abort the acme service immediately!' (which is very stupid). In no situation have I noticed such behavior with the OnFailure|OnSuccess hack. I have been using the latest variant since then and no systemd errors have appeared anymore.

I'm very perplex why systemd doesn't have a mechanism for pausing a service. This must be a common problem.

Copy link
Contributor

@m1cr0man m1cr0man left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that all makes sense. I'm happy to approve this now, but I would be pretty adamant that a test is written up for this feature + the edge cases it handles (one would be suffice). I'm happy to help.

@wegank wegank added the 12.approvals: 2 This PR was reviewed and approved by two persons. label Oct 29, 2024
@nobodyinperson
Copy link
Contributor Author

that a test is written up for this feature + the edge cases

Yes, I'd also like to see this tested, however I have no idea of NixOS tests. If you can give me starting help with the following, I can give it a try:

  • what file/expression do I use as start? (probably the existing acme tests)
  • how exactly would I run that test?
  • how do I simulate the test being open to the internet and the whole letsencrypt thing? 🤔

@Atemu
Copy link
Member

Atemu commented Oct 30, 2024

I've been meaning to do so for a while but finally wrote up: systemd/systemd#34954

@Atemu
Copy link
Member

Atemu commented Oct 30, 2024

The maintainer approved and I don't see any obvious issues with the diff. I think we're good to merge.

@nobodyinperson could you clean up the git commits into sensible steps and make their messages conform with the commit message conventions?

@m1cr0man
Copy link
Contributor

These are great questions @nobodyinperson 😄 let me see what I can do to answer them..

  • what file/expression do I use as start? (probably the existing acme tests)

Yes, the existing acme tests can be extended to cover this use case.

Some basics on NixOS tests: The main docs are in the NixOS manual. It is missing a couple of things though, notably a high level overview and instructions on different execution modes.

  • In a NixOS test, you define 1 or more NixOS VMs (nodes) and a testScript Python program to execute. When built, the script will be executed and, if successful, will result in a store path.
  • Technical detail: NixOS tests are derivations base on the runTest module. A Python program guides the test execution, managing VMs and performing the specified assertions on them. Using specialisations provides a way to create multiple node configurations without creating multiple full-fledged VMs.
  • To build (AKA run) a NixOS test in a flake-like manner, you can use something like nix build --print-build-logs .#nixosTests.acme (assuming working directory == nixpkgs).
  • If you need to step through the test suite or interrogate the nodes, you can run the test interactively by first building the interactive driver with nix build .#nixosTests.acme.driverInteractive and then running result/bin/nixos-test-driver.

This leads nicely into question 2 + 3...

  • how exactly would I run that test?
  • how do I simulate the test being open to the internet and the whole letsencrypt thing? 🤔

The ACME test script creates a few nodes to get the job done:

  • acme: This is our mock LetsEncrypt server, powered by pebble.
  • dnsserver: A mock DNS server used to test DNS-01 resolution, utilising pebble-challtestsrv.
  • webserver: Houses integration tests for most of the modules features + use cases, most notably integration with HTTP servers - nginx, httpd and Caddy. Note the extensive use of specialisations.
  • client: A simple node used to test the validity of configured certs on the webserver services.

This results in a fully simulated ACME world, where we have a trusted certificate authority that can generate signed certs using any challenge mechanism.

The testScript is also fairly extensive, as it has to step through each specialisation and perform assertions on the state of the certificates in each configuration. There is lots of error handling helper code to deal with the fact that it's nigh impossible to synchronise configuration switches + readiness across multiple NixOS VMs.

With all this in mind, adding a test for TLS ALPN should be relatively straight forward. You need to add a specialisation to the webserver which configures the challenge as appropriate, then add an assertion in the testScript to ensure it works.

Other maintainers/readers: Would there be value in putting the contents of this comment on the wiki somewhere? It is a hodgepodge of testing info, but perhaps others may find it useful.

@m1cr0man
Copy link
Contributor

Thanks for the update! I will review this as soon as I can. I'm really intrigued that you found and fixed a deadlock. No worries on the tests - I am rewriting the whole test suite as part of #374792

@nobodyinperson
Copy link
Contributor Author

I'm really intrigued that you found and fixed a deadlock.

Well, it was a deadlock possibility that this PR would have introduced itself, so yeah... 😅 But still good that it should be fixed now.

@m1cr0man
Copy link
Contributor

I think it may be very related to a deadlock (or at least a race condition) with HTTP-01 which stems from an architectural issue in the module. Maybe I'm wrong though, but I'll know once I review.

@wegank wegank removed the 12.approvals: 2 This PR was reviewed and approved by two persons. label Jan 23, 2025
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/using-changes-from-a-nixpkgs-pr-in-your-flake/60948/1

@m1cr0man
Copy link
Contributor

Hey there. We had a big problem with the test suite that just got solved on Monday. This opens up my own time to help you get this in now 🙂 I'll try review it tonight, should have a moment.

Copy link
Contributor

@m1cr0man m1cr0man left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but you will need to rebase now (my fault, sorry ;) )

@Skyb0rg007
Copy link
Contributor

Currently it seems like the strategy is to disable Nginx in order to serve the response via the lego server, with that server taking over port 443. I think this can be avoided in proxy servers like Nginx by recognizing the acme-tls ALPN protocol and conditionally proxying the request to the lego server on a different port -- no need to disable nginx (just restart after the cert is changed). I don't have the exact config, but something like this could potentially work:

stream {
  map $ssl_preread_alpn_protocols $backend {
    ~\bacme-tls/1\b acme-alpn-01;
    default https-server;
  }

  # ...
  # Just need to `proxy_pass localhost:port` if $backend is acme-alpn-01,
  # otherwise serve HTTPS like normal.
}

@nobodyinperson
Copy link
Contributor Author

Here someone put SSL on a different port, but putting lego on a different port sounds way more elegant. That'll be --tls.port, right?

https://serverfault.com/a/958868

@m1cr0man
Copy link
Contributor

Here someone put SSL on a different port, but putting lego on a different port sounds way more elegant. That'll be --tls.port, right?

https://serverfault.com/a/958868

That would be used in combo with what @Skyb0rg007 suggested. LetsEncrypt is only going to try port 443, so you would proxy from the web server to Lego.

What I would say is: this I already a problem for HTTP-01. Users decide between having acme run on port 80 or another web server which proxies requests to it. The same should be true for tls-01.

We would likely want to update the nginx and Httpd modules to provide the option to use tls-alpn, opening up the possibility for not listening on port 80 at all. Whether you do that in this PR or not is up to you, but that will absolutely need accompanying tests to merge.

I still don't mind the conflictingServices option. It seems low enough overhead to have it there and give users the option. We don't have to gate keep how a user sets things up by not adding it.

@nobodyinperson
Copy link
Contributor Author

We would likely want to update the nginx and Httpd modules to provide the option to use tls-alpn, opening up the possibility for not listening on port 80 at all.

Would be very cool if nginx wouldn't need to be shut down. But this PR provides a working solution for the problem of having port 80 blocked and not controlling DNS, so I'd say let's first finish this and improve later separetely.

you will need to rebase now

Done. Couldn't test it on my live NixOS 24.11 deployment though because the PR patch doesn't apply properly anymore. But the squashed patch before rebasing works fine, so I guess it should be okay.

On that note, there's still no tests.

@m1cr0man
Copy link
Contributor

On that note, there's still no tests

That's fine for now, I'll try work on a test suite once this is merged. It is much easier to do now.

Would you be interested in contributing the components for the web browser proxying of tls-alpn?

@nobodyinperson
Copy link
Contributor Author

Would you be interested in contributing the components for the web browser proxying of tls-alpn?

maybe I'll find the time, if it annoys me too much that I have to shut down nginx for the certificates 😉 no promises yet

@m1cr0man
Copy link
Contributor

m1cr0man commented Mar 2, 2025

maybe I'll find the time, if it annoys me too much that I have to shut down nginx for the certificates 😉 no promises yet

That's fine - I was considering doing it myself, so I didn't want to jump the gun if you were already on it.

@wegank wegank added the 2.status: merge conflict This PR has merge conflicts with the target branch label Apr 2, 2025
@nobodyinperson
Copy link
Contributor Author

Wow, something happened and introduced a massive merge conflict. 😮‍💨

@nobodyinperson
Copy link
Contributor Author

I managed to rebase onto master - Not manually (no chance), not with several versions of manually applying nix-shell --run 'treefmt <prfiles>', not with the nixfmt merge driver (choked on some syntax problem), but with the help of mergiraf, which didn't break a sweat. Invaluable tool! ✨

@nobodyinperson
Copy link
Contributor Author

Something isn't right. I pushed ca82750 to this branch but it doesn't appear here. Is it a thing that GitHub disconnects forks somehow? 🤔

@m1cr0man
Copy link
Contributor

m1cr0man commented May 7, 2025

Hm that's really weird, I've never seen this before. Want to try amending the head of the branch then force push? Might wake GitHub up

This commit adds the `tlsMode`, `tlsPort` and `conflictingServices`
options to `services.acme.[...]` to allow using it even if port 80 is
blocked and DNS access is impossible. Integration with nginx's
enableACME=true is adjusted accordingly.
@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label May 7, 2025
@nobodyinperson
Copy link
Contributor Author

Amending and force-pushing seems to have done it, good tip! 👍

@m1cr0man
Copy link
Contributor

m1cr0man commented May 7, 2025

Would you be willing to extend the logic added to nginx into the httpd module also? We tend to maintain feature parity for ACME between the two and it would be nice to have there also.

@nobodyinperson
Copy link
Contributor Author

@m1cr0man Here you have a commit on top/outside this PR that attempts to do this: nobodyinperson@7c6753c

I didn't test it as I don't have experience with httpd and the examples in the manual seem needlessly complicated with apache httpd. I also didn't find any apache/httpd/acme nixos tests to run quickly and gave up as I don't have time for this now.

@duament
Copy link
Contributor

duament commented May 11, 2025

https://github.com/acmesh-official/acme.sh/wiki/TLS-ALPN-without-downtime#nginx

This guide might help us to implement TLS-ALPN without downtime.

@nixpkgs-ci nixpkgs-ci bot added the 12.approvals: 2 This PR was reviewed and approved by two persons. label Jun 25, 2025
@nixpkgs-ci nixpkgs-ci bot added the 2.status: merge conflict This PR has merge conflicts with the target branch label Aug 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: merge conflict This PR has merge conflicts with the target branch 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin. 10.rebuild-linux: 1-10 This PR causes between 1 and 10 packages to rebuild on Linux. 12.approvals: 2 This PR was reviewed and approved by two persons.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants