Support Configurable Large Messages (> 1 MiB) via Zero-Copy Vectored I/O#198
Support Configurable Large Messages (> 1 MiB) via Zero-Copy Vectored I/O#198abhishek10004 wants to merge 4 commits into
Conversation
This introduces support for vectored reads (reading FUSE requests from the device via readv into non-contiguous block buffers, and passing them up to the filesystem via DstBufs). Includes multi-block allocation infrastructure in InMessage, platform-specific support for FuseT (contiguous fallback), and configuration.
This introduces support for vectored writes, bypassing copying write payload bytes into a single contiguous slice in WriteFileOp.Data, instead providing the raw non-contiguous blocks in WriteFileOp.DataBlocks. Includes the optimization and implementation in MemFS, along with wirelog, debug, and test updates.
1891bb9 to
8005ed3
Compare
| } | ||
|
|
||
| var BlockPool1M = newBlockPool(48, func() []byte { | ||
| return make([]byte, MiB) |
There was a problem hiding this comment.
I've not used mmap here because currently I'm overflowing to a syncPool and hence if there is a lot of parallelism, we would not be paying the allocation penalty again and again.
In case we take over the memory allocation using mmap and bypass the go runtime, then there would be 2 options:
a) fixed size buffer pool but that would mean constant allocation/deallocation in case parallelism is higher than the configured limits
b) dynamic pool that keeps growing/shrinking but this would be a slightly larger change & would need more testing.
Hence, I've parked it for later, either as a separate commit or a new change.
vadlakondaswetha
left a comment
There was a problem hiding this comment.
yet to review the testcases and samples
| initOp.MaxReadahead = maxReadahead | ||
| initOp.MaxWrite = buffer.MaxWriteSize | ||
|
|
||
| maxPayload := c.inMessageSize - buffer.GetPageSize() |
There was a problem hiding this comment.
do we anticipate different sizes for read and write. if not can we have just one variable which tells size of the request for both reads and writes.
| } | ||
|
|
||
| // NewInMessage creates a new InMessage. | ||
| func NewInMessage(size int) *InMessage { |
There was a problem hiding this comment.
why are you taking a size parameter if its not used.
| err = nil | ||
| continue | ||
| } | ||
| if errors.Is(err, syscall.ENODEV) { |
There was a problem hiding this comment.
we are removing a typecasting here? Are these changes intentiontal? If yes, how did they work earlier?
There was a problem hiding this comment.
The old code was the idiomatic way to handle nested error before Errors.Is was released.
When reading from a device or a file, the error returned is typically wrapped in PathError. Hence, we were unwrapping it earlier and then comparing it. With Errors.Is, this is no longer needed.
| var err error | ||
| if fusekernel.IsPlatformFuseT { | ||
| n, err = m.ReadSingle(r) | ||
| if len(m.blocks) == 1 { |
There was a problem hiding this comment.
as discussed please remove all changes for MAC and throw not supported exception when messageSize is bigger. Lets not checkin changes which are not reviewed.
| return pageSize | ||
| } | ||
|
|
||
| type blockPool struct { |
There was a problem hiding this comment.
Please create a seperate file for blockPool
| block := BlockPool1M.Get() | ||
| m.borrowedBlocks = append(m.borrowedBlocks, block) | ||
| allocSize := MiB | ||
| if remaining < allocSize { |
There was a problem hiding this comment.
what happens if you remove this check?
There was a problem hiding this comment.
Then the final block won't be truncated to the actual remaining requested size and we'd be passing a buffer which is larger than the requested data.
| // Since n doesn't fit in block 0, and block 0 has size 1MB + pageSize, | ||
| // n is necessarily larger than 1MB (assuming typical small offset like | ||
| // sizeof(ReadIn)). Thus we always allocate directly on the heap. | ||
| return make([]byte, n) |
There was a problem hiding this comment.
in the existing code, if we dont required buffer, we are returning nil vs here we are creating a new buffer.
Also i didnt understand in what scenarios would it cross 1MB?
There was a problem hiding this comment.
I'm doing a new allocation here since the read request can go beyond 1MB reads and if the user has not enabled vectored reads, than we'd have to pass a single buffer of requested size to the user.
Regarding earlier case, in that scenario we had a 1MiB+page size buffer and the header is less than page size, so the original buffer would always have 1MiB space & hence the condition "n > len(m.storage)-m.size" would never be true.
| } | ||
| // Use part of the incoming message storage as the read buffer. | ||
| to.Dst = inMsg.GetFree(int(in.Size)) | ||
| if config.EnableVectoredReads && int(in.Size) > buffer.MiBPlusPageSize { |
There was a problem hiding this comment.
Brainstorming a bit here. why do we need to support both vectoredReads and non vectoredReads. Can we just pass 2-D array always. It would be a minor change on the GCSFuse side. How big of a change will it be on GCSFuse side? I am guessing we can just pick the first block from the array and pass it downstream when messageSize is 1MB?
There was a problem hiding this comment.
I kept support for non-vectored I/O as well so that it doesn't become a breaking change. Otherwise, anyone updating the library would be forced to make changes to their code, even though those changes would mostly be minor.
| var buf []byte | ||
| var dataBlocks [][]byte | ||
|
|
||
| if config.EnableVectoredWrites && inMsg.Len() > uintptr(buffer.MiBPlusPageSize) { |
There was a problem hiding this comment.
by moving everything to vectoredReads/writes we need not do if-else every where. the code becomes much simpler.
There was a problem hiding this comment.
reduces the number of configs too.
There was a problem hiding this comment.
yes, it would become easier & cleaner as well but as mentioned above it'd be a breaking change.
| // In production, any spanning allocation is larger than 1MB (since block 0 | ||
| // is 1MB + pageSize and fits all normal headers/payloads). Thus we always | ||
| // allocate directly from the heap. | ||
| res := make([]byte, n) |
There was a problem hiding this comment.
Same as reads? why would we overflow here and not earlier?
There was a problem hiding this comment.
This new allocation will be used when the user has not enabled vectored writes and increased the fuse max pages limit beyond 1M. In that case, we'd have to allocate a single buffer which can contain the whole content to be written.
|
Hey @abhishek10004. Thanks for putting this together! I know that I advocated for the approach of using readv() to allow large messages (e.g. 16MB reads which is what we need on GCS Standard), while at the same time not having to make all messages 16MB. However, after reviewing this PR in some detail, I am not very comfortable with the complexity of what we're adding and the size of this change. I wonder if instead we need to do something simpler: add a MaxPages config option to MountConfig. If set, that's what we pass in Connection::Init. If unset, we keep the current value based on MaxRead/WriteSize. Note that we'd keep MaxWrite as it is right now. I think the result would be:
What do you think? Would this work? |
We can set MaxPages & MaxWrite to different values. In this case, while the reads would be limited by MaxPages value (which means we can get read requests larger than 1MB), the write requests would be limited by the minimum of what MaxPages & MaxWrite evaluates to. Hence, if we keep setting MaxWrite to 1M, we'd not need to use readv and vectored io at all. This would significantly reduce the amount of change. Only issue would be that in case of writes, we'd not get larger requests but that would be okay because we still buffer the writes locally (in streaming writes, we buffer in the in-memory block till it's full whereas in staging writes we buffer full writes on disk). Hence, smaller requests would not make that much of difference as the major bottleneck in writes would actually be the upload time. |
Exactly. Writes will always have to be buffered /somewhere/ (in an actual buffer, or in a network stream) due to the semantics of object storage that don't allow you to just do random overwrites. So I think that we don't really need large requests for writes. Or at least, we don't need them in the near future. |
Overview
Previously, this library had a hardcoded FUSE message buffer size of
1 MiB + pageSize(corresponding to the standard 1 MiB FUSE payload limit). Because of this hardcoded limit, large reads and writes (> 1 MiB) were not supported at all.This PR adds support for large I/O operations by introducing a configurable
MaxMessageSizeinMountConfig, allowing daemons to read and write messages larger than 1 MiB.To prevent the severe performance regressions, memory fragmentation, and garbage collection (GC) pressure that would arise from allocating giant contiguous buffers (e.g., 4 MiB, 8 MiB, or 16 MiB) on the heap for every request, this support is implemented via Vectored I/O:
readvsystem call. The filesystem can then write the read payload directly into these blocks viaReadFileOp.DstBufs(Zero-Copy Vectored Reads).WriteFileOp.DataBlocks(Zero-Copy Vectored Writes).Key Benefits
BlockPool1MandBlockPool1MPlusPage) and thereadvsystem call.DstandData) for reads/writes under 1 MiB or when vectored I/O is disabled, ensuring existing filesystem implementations continue to work out-of-the-box.Commit-by-Commit Walkthrough
1. Enable Setting FUSE Buffer Sizes Dynamically
Commit:
ddc386cMaxMessageSize uint32to mount_config.go.c.inMessageSizein connection.go based onMaxMessageSize(or defaults to the maximum ofMaxReadSize/MaxWriteSize+ 1 page).Initcall) to announce this dynamically calculated size asMaxWriteandMaxPagesto the FUSE kernel driver.InMessagein internal/buffer/in_message.go to accept a dynamic allocation size instead of using a global staticbufSize.2. Refactor Error Handling using standard
errors.IsCommit:
c263028Connection.readMessage()in connection.go to use standarderrors.Is(err, syscall.ENODEV)anderrors.Is(err, syscall.EINTR)checks instead of type-casting to*os.PathErrorand inspecting inner fields. This ensures compatibility in case errors are wrapped.3. Implement Zero-Copy Vectored Reads Support
Commit:
72acb69ReadFileOp.DstBufsto support large reads (> 1 MiB) without heap thrashing.BlockPool1M(1 MiB blocks) andBlockPool1MPlusPage(1 MiB + hardware page size) in internal/buffer/in_message.go to recycle buffers and avoid heap thrashing.InMessageto allocate non-contiguous blocks (blocks [][]byte) rather than a single contiguous slice. Block 0 is always sized1 MiB + pageSize(holding headers and small payloads), and additional 1 MiB blocks are allocated to satisfy larger message limits.readv):SYS_READVsyscall, converting block slices tounix.Iovecpointers to perform a single-system-call read into multiple non-contiguous memory segments.FuseT) by implementing a contiguous fallback pool (fuseTContiguousPool).EnableVectoredReadstoMountConfig.DstBufs [][]byteto fuseops/ops.go. When enabled and the read size is larger than block 0,Dstis set tonilandDstBufsis populated with the block-sliced buffers, allowing the filesystem to write read payloads directly into the FUSE message blocks.4. Implement Zero-Copy Vectored Writes Support & MemFS Optimization
Commit:
8005ed3memfsmemory filesystem.EnableVectoredWritestoMountConfig.DataBlocks [][]byteand aTotalSize() inthelper method to fuseops/ops.go.convertInMessageslices the write payload directly intoDataBlocks(usingConsumeVector), completely avoiding copying the payload into a single contiguous slice.WriteBlocksAt(blocks [][]byte, off int64)to samples/memfs/inode.go to copy block-by-block directly into the inode's storage slice.WriteFilein samples/memfs/memfs.go to leverageWriteBlocksAtwhenDataBlocksis populated.debug.goandwirelog.goto use the newWriteFileOp.TotalSize()helper method.VectoredWritesTestin samples/memfs/memfs_test.go.Architectural Design: Why Vectored I/O?
To support message sizes larger than 1 MiB, allocating contiguous buffers dynamically (e.g., a single 8 MiB buffer for an 8 MiB read/write) is highly inefficient due to severe GC pressure and heap fragmentation.
Instead, the library now implements a non-contiguous, block-based architecture:
1. Zero-Copy Reads
When a FUSE read request is received:
readvsyscall on Linux to read data directly from the/dev/fusedescriptor into the pooled blocks, avoiding any kernel-to-user memory copy.ReadFileOp.DstBufswith these blocks.DstBufs, requiring zero extra allocations or copy operations.2. Zero-Copy Writes
When a FUSE write request is received:
WriteFileOp.DataBlocks.memfsusingWriteBlocksAt) can consume these blocks directly.Configuration Reference
Three new fields are introduced in
MountConfigto manage the large message and vectored I/O behavior:MaxMessageSizeuint32EnableVectoredReadsboolReadFileOp.Dstand instead populateReadFileOp.DstBufs.EnableVectoredWritesboolWriteFileOp.Dataand instead populateWriteFileOp.DataBlocks.Important
For
MaxMessageSizevalues greater than 1 MiB, enabling bothEnableVectoredReadsandEnableVectoredWritesis highly recommended to avoid significant performance regressions due to large heap allocations and contiguous copies.Testing & Verification
internal/buffer/in_message_test.goverifies all edge cases of multi-block message parsing (consuming across block boundaries, shrinking, and block recycling).samples/memfs/memfs_test.gocontains test cases ensuring that MemFS correctly handles vectored write operations whenEnableVectoredWritesis enabled.internal/buffer/out_message_test.gohave been updated and verified to measure reset, growth, and shrink performance accurately.