Skip to content

Conversation

thomasqueirozb
Copy link
Contributor

Summary

cargo test would hang/fail on master both on MacOS and Linux.

Vector configuration

NA

How did you test this PR?

cargo test

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • cargo fmt --all
      • cargo clippy --workspace --all-targets -- -D warnings
      • cargo nextest run --workspace (alternatively, you can run cargo test --all)
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run cargo vdev build licenses to regenerate the license inventory and commit the changes (if any). More details here.

@thomasqueirozb thomasqueirozb requested a review from a team as a code owner August 4, 2025 17:45
@github-actions github-actions bot added the domain: sources Anything related to the Vector's sources label Aug 4, 2025
@thomasqueirozb thomasqueirozb changed the title fix(tests): Make tests pass with cargo test fix(tests): Make tests pass with cargo test Aug 4, 2025
@thomasqueirozb thomasqueirozb added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label Aug 4, 2025
@thomasqueirozb
Copy link
Contributor Author

@jorgehermo9 this touches some tests you added in case you want to take a look

@@ -329,6 +329,7 @@ async fn test_drop_receiver() {

#[tokio::test]
#[cfg(unix)]
#[cfg_attr(target_os = "macos", ignore)] // Flaky when running `cargo test`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know the reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, tried to debug briefly but didn't get anywhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok no worries, we should create an issue and add a link to it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment wasn't addressed but the thread was resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, I just never linked the issue to the PR. #23672

@jorgehermo9
Copy link
Contributor

jorgehermo9 commented Aug 4, 2025

I didn't have any problem while running those tests on linux, could you point me some reproducible example, CI fail or something?

It is very weird that we have this problem with next_addr_any as it reuses the same functionality of next_addr. Does this happen with other tests that use next_addr?

@pront
Copy link
Member

pront commented Aug 4, 2025

I didn't have any problem while running those tests on linux, could you point me some reproducible example, CI fail or something?

It is very weird that we have this problem with next_addr_any as it reuses the same functionality of next_addr. Does this happen with other tests that use next_addr?

I wonder if the utility itself hits a race condition.

@thomasqueirozb
Copy link
Contributor Author

thomasqueirozb commented Aug 4, 2025

could you point me some reproducible example, CI fail or something?

This is not failing on CI since CI is using cargo nextest. However, if you run cargo test instead off of master right now on linux you will see the it hang on a multicast test

It is very weird that we have this problem with next_addr_any as it reuses the same functionality of next_addr. Does this happen with other tests that use next_addr?

Not sure. I think we're hitting this issue since next_addr_for_ip is trying to use the same port for 2 tests running on separate threads but I could be wrong

@jorgehermo9
Copy link
Contributor

jorgehermo9 commented Aug 4, 2025

This is not failing on CI since CI is using cargo nextest. However, if you run cargo test instead off of master right now on linux you will see the it hang on a multicast test

image

I can't reproduce it, every test is working for me. Executed it a few times and didn't fail... Running it on Linux 6.15.6-arch1-1

If the problem is the utility, maybe we should include the lock in lib/portpicker/src/lib.rs?

The usual way of atomically bind to a port is using the port number 0. But the portpicker lib ensures that both the udp and tcp port are free. Usually, we don't need both ports so a refactor to use port number 0 is feasible... I dug into that code when doing those multicast tests, but preferred to keep using next_addr_any.

Binding with a port number of 0 will request that the OS assigns a port to this listener. The port allocated can be queried via the TcpListener::local_addr method.

https://doc.rust-lang.org/std/net/struct.TcpListener.html#method.bind

However, I'm hesitant of fixing it the way you did, with the lock placed there inside the tests. Maybe we should add a test_util primitive that both gets the address with next_addr_for_ip and binds to the socket atomically with that global lock.

@jorgehermo9
Copy link
Contributor

jorgehermo9 commented Aug 4, 2025

Alternative, I think we should have next_addr_any, next_addr and next_addr_for_ip return the same lock guard you are using here alongside with the SocketAddr. this way, the caller can get any behaviour atomically and only one test that calls next_addr would be running at the same time

Something like

    static ADDR_LOCK: LazyLock<Mutex<()>> = LazyLock::new(|| Mutex::new(()));
pub async fn next_addr_for_ip(ip: IpAddr) -> (MutexGuard,SocketAddr) {
    let guard = ADDR_LOCK.lock().await;
    let port = pick_unused_port(ip);
    (guard,SocketAddr::new(ip, port))
}

But I think tokio's MutexGuard contains references so I'm not sure if you can return the guard like this

We can use instead lock_owned https://docs.rs/tokio/latest/tokio/sync/struct.Mutex.html#method.lock_owned to return that guard

@thomasqueirozb
Copy link
Contributor Author

This is not failing on CI since CI is using cargo nextest. However, if you run cargo test instead off of master right now on linux you will see the it hang on a multicast test
I can't reproduce it, every test is working for me. Executed it a few times and didn't fail... Running it on Linux 6.15.6-arch1-1

Sorry, I thought the issue was reproducible in Linux but it isn't. Going to run the test suite on Linux before merging this just to make sure there are no issues

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. We can introduce a generic test coordinator component in the future to clean this up.

Waiting for an ubuntu run...

@thomasqueirozb
Copy link
Contributor Author

cargo test --all still fails but that's a fight for another day. Merging

@thomasqueirozb thomasqueirozb added this pull request to the merge queue Aug 28, 2025
Merged via the queue into master with commit e071b23 Aug 28, 2025
100 checks passed
@thomasqueirozb thomasqueirozb deleted the fix-cargo-test branch August 28, 2025 21:19
@pront
Copy link
Member

pront commented Sep 3, 2025

cargo test --all still fails but that's a fight for another day. Merging

Worth creating an issue so we can keep track of it. And as a FYI to the community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: sources Anything related to the Vector's sources no-changelog Changes in this PR do not need user-facing explanations in the release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cargo test sometimes fails to run on master
3 participants